How to make a count matrix of common elements across many groups? - r

I'm trying to identify common elements across multiple vectors, with all combinations possible.
I had previously tried this one here, but it doesn't quite work out because it only retrieves the common elements between 2 groups.
Take this example: I have 10 vectors (varying in number of elements) that may have common elements with one or more other vectors. It is also possible that some elements are exclusive to some groups. As an example, here is the data:
#Creating a mock example: 10 groups, with varying number of elements:
set.seed(753)
for (i in 1:10){
assign(paste0("grp_",i), paste0("target_", sample(1:40, sample(20:34))))
}
Simply put, I want to do something analogous to a Venn diagram, but put into a data frame/matrix with the counts, instead. Something like this (note that here, I am just adding a snapshot of random parts of how the result data frame/matrix should look like):
grp1 grp2 grp3 grp4 grp1.grp4.grp5.grp8.grp10
grp1 - 16 12 20 5
grp2 16 - 10 20 4
grp3 12 10 - 16 3
grp4 20 20 16 - 5
grp1.grp4.grp5.grp8.grp10 5 4 3 5 10
grp1.grp2.grp3.grp4.grp5.grp6.grp7.grp8.grp9.grp10 0 0 0 0 0
grp1.grp2.grp3.grp4.grp5.grp6.grp7.grp8.grp9.grp10
grp1 3
grp2 6
grp3 4
grp4 1
grp1.grp4.grp5.grp8.grp10 5
grp1.grp2.grp3.grp4.grp5.grp6.grp7.grp8.grp9.grp10 2
From the table above, please also note that counts that have the same row and column names mean that they are exclusive to that particular group (e.g. count on row1/col1 means that there are 88 exclusive elements).
Any help is very much appreciated!
EDIT: the real counts for the expected final matrix has now been added.

Ok, if I understood all well, lets give it a try. Note that I added your sample data in a list, so we can index them to intersect.
set.seed(753)
grps <- list()
for (i in 1:10){
grps[i] <- list(paste0("target_", sample(1:40, sample(20:34))))
}
You want all 10 groups resulting in 1023 x 1023 combinations
Making it flexible makes testing a bit easier ;)
The key here is I keep them as list with integers that we can index in grps.
N <- 10
combinations <- unlist(sapply(1:N, function(n) combn(1:N, n, simplify = F)), recursive = F)
Now we have to loop twice over your combinations as you compare each 1023 x 1023 combinations with their intersects. The use of sapply gives us the nice 1023 x 1023 matrix you want.
results <- sapply(seq_along(combinations), function(i) {
sapply(seq_along(combinations), function(j) {
length(intersect(
Reduce(intersect, grps[combinations[[i]]]),
Reduce(intersect, grps[combinations[[j]]])
))
})
})
Now we create the names as shown in your example, they are based on the combinations we created and used earlier.
names <- sapply(combinations, function(x) paste("grp", x, sep = "", collapse = "."))
Create the colnames and rownames of the matrix
colnames(results) <- rownames(results) <- names
Seems in your output you want to values for the diagonals, so we change that to NA
diag(results) <- NA

Related

Why does the subsetting of a data.frame() in R behave differently when it has one contrary to multiple columns? [duplicate]

Say I have a data.frame:
df <- data.frame(A=c(10,20,30),B=c(11,22,33), C=c(111,222,333))
A B C
1 10 11 111
2 20 22 222
3 30 33 333
If I select two (or more) columns I get a data.frame:
x <- df[,1:2]
A B
1 10 11
2 20 22
3 30 33
This is what I want. However, if I select only one column I get a numeric vector:
x <- df[,1]
[1] 1 2 3
I have tried to use as.data.frame(), which does not change the results for two or more columns. it does return a data.frame in the case of one column, but does not retain the column name:
x <- as.data.frame(df[,1])
df[, 1]
1 1
2 2
3 3
I don't understand why it behaves like this. In my mind it should not make a difference if I extract one or two or ten columns. IT should either always return a vector (or matrix) or always return a data.frame (with the correct names). what am I missing? thanks!
Note: This is not a duplicate of the question about matrices, as matrix and data.frame are fundamentally different data types in R, and can work differently with dplyr. There are several answers that work with data.frame but not matrix.
Use drop=FALSE
> x <- df[,1, drop=FALSE]
> x
A
1 10
2 20
3 30
From the documentation (see ?"[") you can find:
If drop=TRUE the result is coerced to the lowest possible dimension.
Omit the ,:
x <- df[1]
A
1 10
2 20
3 30
From the help page of ?"[":
Indexing by [ is similar to atomic vectors and selects a list of the specified element(s).
A data frame is a list. The columns are its elements.
You can also use subset:
subset(df, select = 1) # by index
subset(df, select = A) # by name
As mentioned in the comments you can also use dplyr::select, but you do not need to quote the variable name:
library(dplyr)
# by name
df %>%
select(A)
# by index
df %>%
select(1)

Apply a function that requires seq() in R

I am trying to run a summation on each row of dataframe. Let's say I want to take the sum of 100n^2, from n=1 to n=4.
> df <- data.frame(n = seq(1:4),a = rep(100))
> df
n a
1 1 100
2 2 100
3 3 100
4 4 100
Simpler example:
Let's make fun1 our example summation function. I can pull 100 out because I can just multiply it in later.
fun <- function(x) {
i <- seq(1,x,1)
sum(i^2) }
I want to then apply this function to each row to the dataframe, where df$n provides the upper bound of the summation.
The desired outcome would be as follows, in df$b:
> df
n a b
1 1 100 1
2 2 100 5
3 3 100 14
4 4 100 30
To achieve these results I've tried the apply function
apply(df$n,1,phi)
and also with df converted into a matrix
mat <- as.matrix(df)
apply(mat[1,],1,phi)
Both return an error:
Error in seq.default(1, x, 1) : 'to' must be of length 1
I understand this error, in that I understand why seq requires a 'to' value of length 1. I don't know how to go forward.
I have also tried the same while reading the dataframe as a matrix.
Maybe less simple example:
In my case I only need to multiply the results above, df$b, by 100 (or df$a) to get my final answer for each row. In other cases, though, the second value might be more entrenched, for example a^i. How would I call on both variables, a and n?
Underlying question:
My underlying goal is to apply a summation to each row of a dataframe (or a matrix). The above questions stem from my attempt to do so using seq(), as I saw advised in an answer on this site. I will gladly accept an answer that obviates the above questions with a different way to run a summation.
If we are applying seq it doesn't take a vector for from and to. So we can loop and do it
df$b <- sapply(df$n, fun)
df$b
#[1] 1 5 14 30
Or we can Vectorize
Vectorize(fun)(df$n)
#[1] 1 5 14 30

How to compute all possible combinations of multiple vectors/matrices of different sizes and sum up columns simultaneously?

Assume I have three matrices...
A=matrix(c("a",1,2),nrow=1,ncol=3)
B=matrix(c("b","c",3,4,5,6),nrow=2,ncol=3)
C=matrix(c("d","e","f",7,8,9,10,11,12),nrow=3,ncol=3)
I want to find all possible combinations of column 1 (characters or names) while summing up columns 2 and 3. The result would be a single matrix with length equal to the total number of possible combinations, in this case 6. The result would look like the following matrix...
Result <- matrix(c("abd","abe","abf","acd","ace","acf",11,12,13,12,13,14,17,18,19,18,19,20),nrow=6,ncol=3)
I do not know how to add a table in to this question, otherwise I would show it more descriptively. Thank you in advance.
You are mixing character and numeric values in a matrix and this will coerce all elements to character. Much better to define your matrix as numeric and keep the character values as the row names:
A <- matrix(c(1,2),nrow=1,dimnames=list("a",NULL))
B <- matrix(c(3,4,5,6),nrow=2,dimnames=list(c("b","c"),NULL))
C <- matrix(c(7,8,9,10,11,12),nrow=3,dimnames=list(c("d","e","f"),NULL))
#put all the matrices in a list
mlist<-list(A,B,C)
Then we use some Map, Reduce and lapply magic:
res <- Reduce("+",Map(function(x,y) y[x,],
expand.grid(lapply(mlist,function(x) seq_len(nrow(x)))),
mlist))
Finally, we build the rownames
rownames(res)<-do.call(paste0,expand.grid(lapply(mlist,rownames)))
# [,1] [,2]
#abd 11 17
#acd 12 18
#abe 12 18
#ace 13 19
#abf 13 19
#acf 14 20

Get row(s) from data.frame that satisfy a condition composed by an arbitrary amout of sub-conditions in R

I have a data.frame that can contains N columns (N defined at runtime), and I want to get the rows within the data frame that satisfy N-1 conditions, in other words I want to get only the rows with a specific value for the first N-1 columns.
For instance if I have a data frame with four columns (A,B,C,D) and five rows:
A B C D
1 2 3 4
9 9 9 9
1 2 9 5
4 3 2 1
1 2 3 8
I would get all the rows with A==1 & B==2 & C==3, i.e:
A B C D
1 2 3 4
1 2 3 8
But as said, the data frame can be composed of any amount of rows and columns (defined at runtime), and the values of the conditions may change.
I implemented this function (simplified):
getRows<-function(dataFrame, values) {
conditions=rep(TRUE, dim(dataFrame)[1])
for (k in 1:length(values)) {
conditions=conditions&(dataFrame[,k]==values[k])
}
return(dataFrame[conditions,])
}
Of course, this assumes the values in the values vector are sorted with respect to the columns order of the data frame, and that the length of the vector is N-1.
The function works but I've the feeling that it is not really efficient to create the vector of boolean, evaluate boolean expressions in this way and so on... especially if the data frame contains many data.
Another solution that I found is:
getRows<-function(dataFrame, values) {
tmp=dataFrame
for (k in 1:length(values)) {
tmp=tmp[tmp[,k]==values[k],]
}
return(tmp)
}
Basically this 'reduces' the data frame by filtering out all the rows that not satisfy each condition. But I think this is even worst, because it creates a new data frame object for each condition (ok always smaller, but anyway...).
So my question is: is there a method to do that more efficiently?
one possibility:
# if you are only checking for equalities
f <- function(df, values){
# values must be a list with the columns names of df as names and the conditions
# if you
y <- paste(names(values), unlist(values), sep="==", collapse=" & ")
return(df[eval(parse(text=y), envir=df),])
}
l <- as.vector(1:3, "list")
names(l) <- colnames(df)[-ncol(df)]
f(df, l)
A B C D
1 1 2 3 4
5 1 2 3 8
# you can also use other conditions
f <- function(df, values){
# values must be a list with the columns names of df as names and the conditions
# if you
y <- paste(names(values), unlist(values), collapse=" & ")
return(df[eval(parse(text=y), envir=df),])
}
l <- as.vector(paste0(c("==", "<=", "=="), 1:3), "list")
names(l) <- colnames(df)[-ncol(df)]
f(df, l)
A B C D
1 1 2 3 4
5 1 2 3 8
Sometimes matrices are quicker than data.frames to operate on, so something along the lines of:
mat <- t(as.matrix(df[-ncol(df)))
boolMat <- (mat==values) # if necessary use match to reorder values to match columns of df
ind <- colSums(boolMat)==nrow(boolMat)
df[ind,]
The idea being that values will get recycled along the columns of the matrix (which are the rows of the dataframe). colSums is meant to be quicker than an apply, so the final line should be somewhat optimised compared to apply(boolMat, 2, all).
The optimal solutions will depend on the size and proportions of the data; whether the entries are all integers; and maybe what proportion of matches you get in the data. So as #droopy mentions, you'll need to benchmark. My approach involves creating a copy of the data, so if your data is already approaching memory limits, then it might struggle - but maybe then you could generate your data in matrix rather than data.frame format to save the duplication.

Append rle result from loop

I am running a coin-toss simulation with a loop which runs about 1 million times.
Each time I run the loop I wish to retain the table output from the RLE command. Unfortunately a simple append does not seem to be appropriate. Each time I run the loop I get a slightly different amount of data which seems to be one of the sticking points.
This code gives an idea of what I am doing:
N <- 5 #Number of times to run
rlex <-NULL
#begin loop#############################
for (i in 1:N) { #tells R to repeat N number
x <-sample(0:1, 100000, 1/2)
rlex <-append(rlex, rle(x))
}
table(rlex) #doesn't work
table(rle(x)) #only 1
So instead of having five separate rle results (in this simulation, 1 million in the full version), I want one merged rle table. Hope this is clear. Obviously my actual code is a bit more complex, hence any solution should be as close to what I have specified as possible.
UPDATE: The loop is an absolute requirement. No ifs or buts. Perhaps I can pull out the table(rle(x)) data and put it into a matrix. However again the stumbling block is the fact that some of the less frequent run lengths do not always turn up in each loop. Thus I guess I am looking to conditionally fill a matrix based on the run length number?
Last update before I give up: Retaining the rle$values will mean that too much data is being retained. My simulation is large-scale and I really only wish to retain the table output of the rle. Either I retain each table(rle(x)) for each loop and combine by hand (there will be thousands), or I find a programmatic way to keep the data (yes for zeroes and ones) and have one table that is formed from merging each of the individual loops as I go along.
Either this is easyish to do, as specified, or I will not be doing it. It may seem a silly idea/request, but that should be incidental to whether it can be done.
Seriously last time. Here is an animated gif showing what I expect to happen.
After each iteration of the loop data is added to the table. This is as clear as I am going to be able to communicate it.
OK, attempt number 4:
N <- 5
set.seed(1)
x <- NULL
for (i in 1:N){
x <- rbind(x, table(rle(sample(0:1, 100000, replace=TRUE))))
}
x <- as.data.frame(x)
x$length <- as.numeric(rownames(x))
aggregate(x[, 1:2], list(x[[3]]), sum)
Produces:
Group.1 0 1
1 1 62634 62531
2 2 31410 31577
3 3 15748 15488
4 4 7604 7876
5 5 3912 3845
6 6 1968 1951
7 7 979 971
8 8 498 477
9 9 227 246
10 10 109 128
11 11 65 59
12 12 24 30
13 13 21 11
14 14 7 10
15 15 0 4
16 16 4 2
17 17 0 1
18 18 0 1
If you want the aggregation inside the loop, do:
N <- 5
set.seed(1)
x <- NULL
for (i in 1:N){
x <- rbind(x, table(rle(sample(0:1, 100000, replace=TRUE))))
y <- aggregate(x, list(as.numeric(rownames(x))), sum)
print(y)
}
Following up #CarlWitthoft's answer, you probably want:
N <- 5
rlex <-NULL
for (i in 1:N) {
x <-sample(0:1, 100000, 1/2)
rlex <-append(rlex, rle(x)$lengths)
}
since I think you don't care about the $values component (i.e. whether each run is a run of zeros or ones).
Result: one long vector of run lengths.
But this would probably be a lot more efficient:
maxlen <- 30
rlemat <- matrix(nrow=N,ncol=maxlen)
for (i in 1:N) {
x <-sample(0:1, 100000, 1/2)
rlemat[i,] <- table(factor(rle(x)$lengths,levels=1:maxlen))
}
Result: an N by maxlen table of run lengths from each iteration.
If you only want to save the total number of runs of each length you could try:
rlecumsum <- rep(0,maxlen)
for (i in 1:N) {
x <-sample(0:1, 100000, 1/2)
rlecumsum <- rlecumsum + table(factor(rle(x)$lengths,levels=1:maxlen))
}
Result: an vector of length maxlen of the total numbers of run lengths across all iterations.
And here's my final answer:
rlecumtab <- matrix(0,ncol=2,nrow=maxlen)
for (i in 1:N) {
x <- sample(0:1, 100000, 1/2)
r1 <- rle(x)
rtab <- table(factor(r1$lengths,levels=1:maxlen),r1$values)
rlecumtab <- rlecumtab + rtab
}
Result: a maxlen by 2 table of the total numbers of run lengths across all iterations, divided by type (0-run vs 1-run).
You need to read the help page for rle . Consider:
names(rlex) #"lengths" "values" "lengths" "values" .... and so on
In the meantime, I strongly suggest you spend some time reading up on statistical methods. There is zero (+/- epsilon) chance that running a binomial simulation a million times will tell you anything you won't learn after a few hundred tries, unless your coin has p=1e-5 :-).

Resources