I need to be able to compute pairwise intersection of lists, close to 40k.
Specifically, I want to know if I can store vector id as column 1, and a list of its values in column 2. I should be able to process this column 2 , ie find overlap/intersections between two rows.
column 1 column 2
idA 1,2,5,9,10
idB 5,9,25
idC 2,25,67
I want to be able to get the pairwise intersection values and also, if the values in column 2 are not already sorted, that should also be possible.
What is the best datastructure that I can use if I am going ahead with R?
My data originally looks like this:
column1 1 2 3 9 10 25 67 5
idA 1 1 0 1 1 0 0 1
idB 0 0 0 1 0 1 0 1
idC 0 1 0 0 0 1 1 0
edited to include more clarity as per the suggestions below.
I'd keep the data in a logical matrix:
DF <- read.table(text = "column1 1 2 3 9 10 25 67 5
idA 1 1 0 1 1 0 0 1
idB 0 0 0 1 0 1 0 1
idC 0 1 0 0 0 1 1 0", header = TRUE, check.names = FALSE)
#turn into logical matrix
m <- as.matrix(DF[-1])
rownames(m) <- DF[[1]]
mode(m) <- "logical"
#if you can, create your data as a sparse matrix to save memory
#if you already have a dense data matrix, keep it that way
library(Matrix)
M <- as(m, "lMatrix")
#calculate intersections
#does each comparison twice
intersections <- simplify2array(
lapply(seq_len(nrow(M)), function(x)
lapply(seq_len(nrow(M)), function(x, y) colnames(M)[M[x,] & (M[x,] == M[y,])], x = x)
)
)
This double loop could be optimized. I'd do it in Rcpp and create a long format data.frame instead of a list matrix. I'd also do each comparison only once (e.g., only the upper triangle).
colnames(intersections) <- rownames(intersections) <- rownames(M)
# idA idB idC
#idA Character,5 Character,2 "2"
#idB Character,2 Character,3 "25"
#idC "2" "25" Character,3
intersections["idA", "idB"]
#[[1]]
#[1] "9" "5"
I need to create an adjacency matrix from a dataframe using tcrossprod, but the resulting matrix needs to obey a restriction that I will explain below. Consider the following dataframe:
z <- data.frame(Person = c("a","b","c","d"), Man_United = c(1,0,1,0))
z
Person Man_United
1 a 1
2 b 0
3 c 1
4 d 0
I make an adjacency matrix from z using tcrossprod.
x <- tcrossprod(table(z))
diag(x) <- 0
x
Person
Person a b c d
a 0 0 1 0
b 0 0 0 1
c 1 0 0 0
d 0 1 0 0
I need the resulting adjacency matrix to indicate a tie (here signaled with the number 1), only when both persons have value 1 in the original dataframe (i.e. are fans of Manchester United, in this example). For example, persons "a" and "c" of dataframe z are fans, so in the resulting adjacency matrix I want their intersecting cell to be valued 1. That works fine here. However, persons "b" and "d" are not fans, and the fact that both have value 0 in the original dataframe does not mean that they are connected in any meaningful way. tcrossprod, however, produces a matrix that suggests that they are in fact connected.
How to use tcrossprod in a way that it caputures only the positve values of dataframes in producing adjacency matrices?
We may restrict attention on table results of ones with
tcrossprod(table(z)[, "1"])
# [,1] [,2] [,3] [,4]
[# 1,] 1 0 1 0
# [2,] 0 0 0 0
# [3,] 1 0 1 0
# [4,] 0 0 0 0
or, if you want to preserve the names,
tcrossprod(table(z)[, "1", drop = FALSE])
# Person
# Person a b c d
# a 1 0 1 0
# b 0 0 0 0
# c 1 0 1 0
# d 0 0 0 0
If there can be more nonzero values, then you may replace "1" by -1 as to eliminate the column for zeroes.
I try to create an adjacency matrix M from a list pList containing the indices that have to be equal to 1 in the matrix M.
For example, M is a 10x5 matrix
The variable pList contains 5 elements, each one is a vector of indices
Example :
s <- list("1210", c("254", "534"), "254", "534", "364")
M <- matrix(c(rep(0)),nrow = 5, ncol = length(unique(unlist(s))), dimnames=list(1:5,unique(unlist(s))))
Actually, my too simple solution is the brutal one with a for loop over rows of the matrix :
for (i in 1:nrow(M)){
M[i, as.character(s[[i]])] <- 1
}
So that the expected result is :
M
1210 254 534 364
1 1 0 0 0
2 0 1 1 0
3 0 1 0 0
4 0 0 1 0
5 0 0 0 1
The problem is that I have to manipulate matrices with several thousands of lines and it takes too much time.
I am not a "apply" expert but I wonder if there is a quicker solution
Thanks
Regards
We can convert the list to a matrix of row/column index, use that index to assign the elements in 'M' to 1.
M[as.matrix(stack(setNames(s, seq_along(s)))[,2:1])] <- 1
M
# 1210 254 534 364
#1 1 0 0 0
#2 0 1 1 0
#3 0 1 0 0
#4 0 0 1 0
#5 0 0 0 1
Or instead of using stack to convert to a data.frame, we can unlist the 's' to get the column index, cbind with row index created by replicating the sequence of list with length of each list element (using lengths) and assign the elements in 'M' to 1.
M[cbind(rep(seq_along(s), lengths(s)), unlist(s))] <- 1
Or yet another option would be to create a sparseMatrix
library(Matrix)
Un1 <- unlist(s)
sparseMatrix(i = rep(seq_along(s), lengths(s)),
j=as.integer(factor(Un1, levels = unique(Un1))),
x=1)
I have a data frame and I want to create a boolean data frame from it. I want to make all unique values of every column in the original data frame as column names in the bolean data frame. To show it using an example:
mydata =
sex route
m oral
f oral
m topical
f unknown
Then, I want to create
m f oral topical unknown
1 0 1 0 0
0 1 1 0 0
1 0 0 1 0
0 1 0 0 1
I am using the code below to create the bolean data frame. It works in R but not in shiny. What could be the problem?
col_names=c()
for(i in seq(1,ncol(mydata))){
col_names=c(col_names,unique(mydata[i]))
}
col_names= as.vector(unlist(col_names))
my_boolean= data.frame(matrix(0, nrow = nrow(mydata), ncol = length(col_names)))
colnames( my_boolean)=col_names
for(i in seq(1,nrow(mydata))){
for(j in seq(1,ncol(mydata)))
{
my_boolean[i,which(mydata[i,j]==colnames(my_boolean))]=1
}}
There are a few ways you can do this, but I always find table the easiest to understand. Here's an approach with table:
do.call(cbind, lapply(mydf, function(x) table(1:nrow(mydf), x)))
## f m oral topical unknown
## 1 0 1 1 0 0
## 2 1 0 1 0 0
## 3 0 1 0 1 0
## 4 1 0 0 0 1
I have posted a question yesterday, and got wonderful response from the experts. However, I am facing another question now, I found the jobs cannot be done in my real data as my starting file (df1) are too large. I wonder if there are faster method to do the same job without using adply or for loop?
My original questions is listed as below:
Step 1: I have a simplified dataframe like this:
df1 = data.frame (B=c(1,0,1), C=c(1,1,0)
, D=c(1,0,1), E=c(1,1,0), F=c(0,0,1)
, G=c(0,1,0), H=c(0,0,1), I=c(0,1,0))
B C D E F G H I
1 1 1 1 1 0 0 0 0
2 0 1 0 1 0 1 0 1
3 1 0 1 0 1 0 1 0
Step 2: I want to do row wise subtraction, i.e. (row1 - row2), (row1 - row3) and (row2 - row3)
row1-row2 1 0 1 0 0 -1 0 -1
row1-row3 0 1 0 1 -1 0 -1 0
row2-row3 -1 1 -1 1 -1 1 -1 1
step 3: replace all -1 to 0
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1
Could you mind to teach me how to do so in a less memory-demanding approach?
The fastest way I know to do step 2 is to use indices into df1 for the various pairwise comparisons you want to do. The combn() function can be used to generate the set of row-by-row comparisons required. (Using this will be the rate limiting step for big data sets.)
For the combinations of row-by-rows operations we want to form:
> cmb <- combn(as.numeric(rownames(df1)), 2)
> cmb
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 3
The rows of cmb represent the two sets of indices required from df1 required to form the three rows of your requested output. (The columns, 3, represent the 3 rows in your expected result.)
The next step is to use the two rows of cmb to index df1 and use a standard vectorised operation in R via -, e.g.:
> (out <- df1[cmb[1,], ] - df1[cmb[2,], ])
B C D E F G H I
1 1 0 1 0 0 -1 0 -1
1.1 0 1 0 1 -1 0 -1 0
2 -1 1 -1 1 -1 1 -1 1
Step 3 can now be done, although I am assuming that there can only be 1, 0, and -1 values in the resulting output:
> out[out < 0] <- 0
> out
B C D E F G H I
1 1 0 1 0 0 0 0 0
1.1 0 1 0 1 0 0 0 0
2 0 1 0 1 0 1 0 1
Which is consistent with the output you requested.
For big operations, doing this with matrices might be faster overall. So we could do:
> mat <- data.matrix(df1)
> cmb <- combn(seq_len(NROW(mat)), 2)
> cmb
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 3
> out2 <- mat[cmb[1,], ] - mat[cmb[2,], ]
> out2[out2 < 0] <- 0
> out2
B C D E F G H I
[1,] 1 0 1 0 0 0 0 0
[2,] 0 1 0 1 0 0 0 0
[3,] 0 1 0 1 0 1 0 1
If you need the rownames as you show, then you can easily generate these at the end:
> apply(cmb, 2, function(x) paste("row", x[1], "-row", x[2], sep = ""))
[1] "row1-row2" "row1-row3" "row2-row3"
which can be used as:
> rownames(out) <- apply(cmb, 2, function(x) paste("row", x[1], "-row", x[2], sep = ""))
> out
B C D E F G H I
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1
Using the sqldf package or RSQLite directly would allow one to do this with all computations done outside of R so that there would be no intermediate storage required. We illustrate using sqldf. See the sqldf home page for more info.
Alternative 1 In this approach note that we use dbname = tempfile() so that it performs all computations in an external database (which it creates on the fly and automatically deletes) rather than doing it in memory.
library(sqldf)
gc()
DF <- sqldf("select x.rowid x, y.rowid y,
max(x.B - y.B, 0) B, max(x.C - y.C, 0) C,
max(x.D - y.D, 0) D, max(x.E - y.E, 0) E,
max(x.F - y.F, 0) F, max(x.G - y.G, 0) G,
max(x.H - y.H, 0) H, max(x.I - y.I, 0) I
from df1 x, df1 y
where x.rowid > y.rowid", dbname = tempfile())
This would only require that we are able to store df1 and DF in our workspace.
Alternative 2. If even that overflows we can write out df1, remove it, perform the calculation below and then we would only need sufficient storage to store the result, DF.
read.csv.sql uses dbname = tempfile() by default so in this case we do not need to specify it.
write.table(df1, "data.txt", sep = ",", quote = FALSE)
rm(df1)
gc()
DF <- read.csv.sql("data.txt", sql = "select
x.rowid x, y.rowid y,
max(x.B - y.B, 0) B, max(x.C - y.C, 0) C,
max(x.D - y.D, 0) D, max(x.E - y.E, 0) E,
max(x.F - y.F, 0) F, max(x.G - y.G, 0) G,
max(x.H - y.H, 0) H, max(x.I - y.I, 0) I
from file x, file y
where x.rowid > y.rowid")
(Of course, if its really this large then you might have trouble doing any subsequent calculations on it too.)
Output. At any rate, both alternatives give the same result shown below. x and y show which input rows were subtracted.
> DF
x y B C D E F G H I
1 2 1 0 0 0 0 0 1 0 1
2 3 1 0 0 0 0 1 0 1 0
3 3 2 1 0 1 0 1 0 1 0
Note. Although the question asked for optimizing memory rather than speed if speed were an issue one could add indexes.
Since the data is homogeneous, use a matrix representation. Organize it so that the 'rows' are columns, as
m <- t(as.matrix(df1))
mode(m) <- "integer" # maybe already true?
pre-allocate the space for an answer
n <- ncol(m) - 1
ans <- matrix(0L, nrow(m), (n+1) * n / 2)
We want to compare column 1 to columns 1:n + 1L (the 1L treats the number one as an integer value, rather than real). This is m[,1] - m[, 1:n + 1L], using R's recycling. Iterating over columns, with idx and off helping to keep track of the index of the columns we want to compare to, and the placement columns in the answer
off <- 0
for (i in 1:n) {
idx <- i:n + 1L
ans[, off + seq_along(idx)] <- m[, i] - m[, idx]
off <- off + length(idx)
}
The final step is
ans[ans<0L] <- 0L
Maybe there are additional efficiencies from realizing that the truth table under the original operation is 0 unless m[,1] == 1 & m[, 1:n + 1L] == 0. Likewise if space were a serious issue then the data might be represented as mode(m) <- "raw" and the arithmetic operations replaced with the comparison just suggested, along the lines of:
m <- t(as.matrix(df1))
mode(m) <- "raw"
off <- 0
x0 <- as.raw(0); x1 <- as.raw(1)
ans <- matrix(raw(), nrow(m), (n+1) * n / 2)
for (i in 1:n) {
idx <- i:n + 1L
updt <- which((m[, i] == x1) & (m[, idx] == x0))
ans[off + updt] <- x1
off <- off + length(idx) * nrow(ans)
}