Join and sum not compatible matrices - r

My goal is to "sum" two not compatible matrices (matrices with different dimensions) using (and preserving) row and column names.
I've figured this approach: convert the matrices to data.table objects, join them and then sum columns vectors.
An example:
> M1
1 3 4 5 7 8
1 0 0 1 0 0 0
3 0 0 0 0 0 0
4 1 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0
> M2
1 3 4 5 8
1 0 0 1 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 0 0 0 0 0
8 0 0 0 0 0
> M1 %ms% M2
1 3 4 5 7 8
1 0 0 2 0 0 0
3 0 0 0 0 0 0
4 2 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0
This is my code:
M1 <- matrix(c(0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0), byrow = TRUE, ncol = 6)
colnames(M1) <- c(1,3,4,5,7,8)
M2 <- matrix(c(0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0), byrow = TRUE, ncol = 5)
colnames(M2) <- c(1,3,4,5,8)
# to data.table objects
DT1 <- data.table(M1, keep.rownames = TRUE, key = "rn")
DT2 <- data.table(M2, keep.rownames = TRUE, key = "rn")
# join and sum of common columns
if (nrow(DT1) > nrow(DT2)) {
A <- DT2[DT1, roll = TRUE]
A[, list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1), by = rn]
}
That outputs:
rn X1 X3 X4 X5 X7 X8
1: 1 0 0 2 0 0 0
2: 3 0 0 0 0 0 0
3: 4 2 0 0 0 0 0
4: 5 0 0 0 0 0 0
5: 7 0 0 0 0 1 0
6: 8 0 0 0 0 0 0
Then I can convert back this data.table to a matrix and fix row and column names.
The questions are:
how to generalize this procedure?
I need a way to automatically create list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1) because i want to apply this function to matrices which dimensions (and row/columns names) are not known in advance.
In summary I need a merge procedure that behaves as described.
there are other strategies/implementations that achieve the same goal that are, at the same time, faster and generalized? (hoping that some data.table monster help me)
to what kind of join (inner, outer, etc. etc.) is assimilable this procedure?
Thanks in advance.
p.s.: I'm using data.table version 1.8.2
EDIT - SOLUTIONS
#Aaron solution. No external libraries, only base R. It works also on list of matrices.
add_matrices_1 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
out <- array(0, dim = c(length(rows), length(cols)), dimnames = list(rows,cols))
for (m in a) out[rownames(m), colnames(m)] <- out[rownames(m), colnames(m)] + m
out
}
#MadScone solution. Use reshape2 package. It works only on two matrices per call.
add_matrices_2 <- function(m1, m2) {
m <- acast(rbind(melt(M1), melt(M2)), Var1~Var2, fun.aggregate = sum)
mn <- unique(colnames(m1), colnames(m2))
rownames(m) <- mn
colnames(m) <- mn
m
}
#Aaron solution. Use Matrix package. It work only on sparse matrices, also on list of them.
add_matrices_3 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
nrows <- length(rows)
ncols <- length(cols)
newms <- lapply(a, function(m) {
s <- summary(m)
i <- match(rownames(m), rows)[s$i]
j <- match(colnames(m), cols)[s$j]
ilj <- i < j
sparseMatrix(
i = ifelse(ilj, i, j),
j = ifelse(ilj, j, i),
x = s$x,
dims = c(nrows, ncols),
dimnames = list(rows, cols),
symmetric = TRUE
)
})
Reduce(`+`, newms)
}
BENCHMARK (100 runs with microbenchmark package)
Unit: microseconds
expr min lq median uq max
1 add_matrices_1 196.009 257.5865 282.027 291.2735 549.397
2 add_matrices_2 13737.851 14697.9790 14864.778 16285.7650 25567.448
No need to comment the benchmark: #Aaron solution wins.
Details
For insights about performances (that depend of the size and the sparsity of the matrices) see #Aaron's edit (and the solution for sparse matrices: add_matrices_3).

I'd just line up the names and go to town with base R.
Here's a simple function that takes an unspecified number of matrices and adds them up by their row/column names.
add_matrices_1 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
out <- array(0, dim=c(length(rows), length(cols)), dimnames=list(rows,cols))
for(M in a) { out[rownames(M), colnames(M)] <- out[rownames(M), colnames(M)] + M }
out
}
It then works like this:
# giving them rownames and colnames
colnames(M1) <- rownames(M1) <- c(1,3,4,5,7,8)
colnames(M2) <- rownames(M2) <- c(1,3,4,5,8)
add_matrices_1(M1, M2)
# 1 3 4 5 7 8
# 1 0 0 2 0 0 0
# 3 0 0 0 0 0 0
# 4 2 0 0 0 0 0
# 5 0 0 0 0 0 0
# 7 0 0 0 0 1 0
# 8 0 0 0 0 0 0
For bigger matrices, however, it doesn't do as well. Here's a function to make a matrix, choosing n columns out of N possibilities, and filling k spots with non-zero values. (This assumes symmetrical matrices.)
makeM <- function(N, n, k) {
s1 <- sample(N, n)
M1 <- array(0, dim=c(n,n), dimnames=list(s1, s1))
r1 <- sample(n,k, replace=TRUE)
c1 <- sample(n,k, replace=TRUE)
M1[cbind(c(r1,c1), c(c1,r1))] <- sample(N,k)
M1
}
Then here's another version that uses sparse matrices.
add_matrices_3 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
nrows <- length(rows)
ncols <- length(cols)
newms <- lapply(a, function(m) {
s <- summary(m)
i <- match(rownames(m), rows)[s$i]
j <- match(colnames(m), cols)[s$j]
ilj <- i<j
sparseMatrix(i=ifelse(ilj, i, j),
j=ifelse(ilj, j, i),
x=s$x,
dims=c(nrows, ncols),
dimnames=list(rows, cols), symmetric=TRUE)
})
Reduce(`+`, newms)
}
This version is definitely faster when the matrices are large and sparse. (Note that I'm not timing the conversion to a sparse symmetric matrix, as hopefully if that's a suitable format, you'll use that format throughout your code.)
set.seed(50)
M1 <- makeM(10000, 5000, 50)
M2 <- makeM(10000, 5000, 50)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
system.time(add_matrices_1(M1, M2))
# user system elapsed
# 2.987 0.841 4.133
system.time(add_matrices_3(mm1, mm2))
# user system elapsed
# 0.042 0.012 0.504
But when the matrices are small, my first solution is still faster.
set.seed(50)
M1 <- makeM(100, 50, 20)
M2 <- makeM(100, 50, 20)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
microbenchmark(add_matrices_1(M1, M2), add_matrices_3(mm1, mm2))
# Unit: microseconds
# expr min lq median uq max
# 1 add_matrices_1(M1, M2) 398.495 406.543 423.825 544.0905 43077.27
# 2 add_matrices_3(mm1, mm2) 5734.623 5937.473 6044.007 6286.6675 509584.24
Moral of the story: Size and sparsity matter.
Also, getting it right is more important than saving a few microseconds. It's almost always best to use simple functions and don't worry about speed unless you run into trouble. So in small cases, I'd prefer MadScone's solution, as it's easy to code and simple to understand. When that gets slow, I'd write a function like my first attempt. When that gets slow, I'd write a function like my second attempt.

Here is a data.table solution. The magic is to add the .SD components (which have identical names in both) then assign the remaining column by reference.
# a function to quickly get the non key columns
nonkey <- function(DT){ setdiff(names(DT),key(DT))}
# the columns in DT1 only
notinR <- setdiff(nonkey(DT1), nonkey(DT2))
#calculate; .. means "up one level"
result <- DT2[DT1, .SD + .SD, roll= TRUE][,notinR := unclass(DT1[, ..notinR])]
# re set the column order to the original (DT1) order
setcolorder(result, names(DT1))
# voila!
result
rn 1 3 4 5 7 8
1: 1 0 0 2 0 0 0
2: 3 0 0 0 0 0 0
3: 4 2 0 0 0 0 0
4: 5 0 0 0 0 0 0
5: 7 0 0 0 0 1 0
6: 8 0 0 0 0 0 0
I'm not convinced this is a particularly stable solution, given that I'm not sure it isn't fluking the answer because M1 and M2 are subsets of eachother
Edit, an ugly approach using eval
This is made harder because you have non-syntatic names (`1` etc)
inBoth <- intersect(nonkey(DT1), nonKey(DT2))
backquote <- function(x){paste0('`', x, '`')}
bqBoth <- backquote(inBoth)
charexp <- sprintf('list(%s)',paste(c(paste0( bqBoth,'=', bqBoth, '+ i.',inBoth), backquote(notinR)), collapse = ','))
result2 <- DT2[DT1,eval(parse(text = charexp)), roll = TRUE]
setcolorder(result2, names(DT1))
# voila!
result2
rn 1 3 4 5 7 8
1: 1 0 0 2 0 0 0
2: 3 0 0 0 0 0 0
3: 4 2 0 0 0 0 0
4: 5 0 0 0 0 0 0
5: 7 0 0 0 0 1 0
6: 8 0 0 0 0 0 0

I think I managed to do it with this single disgusting line:
cast(aggregate(value ~ X1 + X2, rbind(melt(M1), melt(M2)), sum), X1 ~ X2)[,-1]
This makes use of the reshape package. Returned as a data frame so convert to matrix as necessary.
If you want it in the format you suggested in your example, try this:
"%ms%" <- function(m1, m2) {
m <- as.matrix(cast(aggregate(value ~ X1 + X2, rbind(melt(m1), melt(m2)), sum), X1 ~ X2)[,-1])
mn <- unique(colnames(m1), colnames(m2))
rownames(m) <- mn
colnames(m) <- mn
return (m)
}
Then you can do:
M1 %ms% M2
EDIT:
EXPLANATION
Obviously should have some explanation sorry.
melt(M1)
Converts M1 from its original form into a format like this (row, col, value). E.g.
1 3 4 5 7 8
1 0 0 1 0 0 0
3 0 0 0 0 0 0
4 1 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0
Is converted to:
X1 X2 value
1 1 1 0
2 3 1 0
3 4 1 1
etc. Combining M1 and M2 lists every possible (row, col, value) across both matrix into one single matrix. Now this:
aggregate(value ~ X1 + X2, rbind(melt(M1), melt(M2)), sum)
Sums values where the row and column are the same. So it will sum (1, 1) across both matrices for example. And (3, 1) etc. It won't do anything that doesn't exist e.g. M2 doesn't have a 7th column/row.
Finally cast transforms the matrix so that it is written with the result of aggregate's first column as rows, its second column as columns. Effectively undoing the melt from earlier. The [,-1] is taking off an unnecessary column leftover from cast (I think there is probably a better way of doing that but I don't know how).
As I said, it's returned as a data frame so use as.matrix() on the result if that's what you wish.

Related

Add X number of columns to a data.frame

I would like to add a varying number (X) of columns with 0 to an existing data.frame within a function.
Here is an example data.frame:
dt <- data.frame(x=1:3, y=4:6)
I would like to get this result if X=1 :
a x y
1 0 1 4
2 0 2 5
3 0 3 6
And this if X=3 :
a b c x y
1 0 0 0 1 4
2 0 0 0 2 5
3 0 0 0 3 6
What would be an efficient way to do this?
We can assign multiple columns to '0' based on the value of 'X'
X <- 3
nm1 <- names(dt)
dt[letters[seq_len(X)]] <- 0
dt[c(setdiff(names(dt), nm1), nm1)]
Also, we can use add_column from tibble and create columns at a specific location
library(tibble)
add_column(dt, .before = 1, !!!setNames(as.list(rep(0, X)),
letters[seq_len(X)]))
A second option is cbind
f <- function(x, n = 3) {
cbind.data.frame(matrix(
0,
ncol = n,
nrow = nrow(x),
dimnames = list(NULL, letters[1:n])
), x)
}
f(dt, 5)
# a b c d e x y
#1 0 0 0 0 0 1 4
#2 0 0 0 0 0 2 5
#3 0 0 0 0 0 3 6
NOTE: because letters has a length of 26 the function would need some adjustment regarding the naming scheme if n > 26.
You can try the code below
dt <- cbind(`colnames<-`(t(rep(0,X)),letters[seq(X)]),dt)
If you don't care the column names of added columns, you can use just
dt <- cbind(t(rep(0,X)),dt)
which is much shorter

rbind list of arbitrary number of dataframes

I have a list of dataframes with some overlapping columns in each. The number of dataframes in the list is unknown. How can I efficiently, in base, rbind the dataframes together and fill in non overlapping columns with zeros?
Example data:
x <- data.frame(a=1:2, b=1:2, c=1:2)
y <- data.frame(a=1:2, r=1:2, f=1:2)
z <- data.frame(b=1:3, c=1:3, v=1:3, t=c("A", "A", "D"))
L1 <- list(x, y, z)
Desired output:
a b c f r t v
1 1 1 1 0 0 0 0
2 2 2 2 0 0 0 0
3 1 0 0 1 1 0 0
4 2 0 0 2 2 0 0
5 0 1 1 0 0 A 1
6 0 2 2 0 0 A 2
7 0 3 3 0 0 D 3
Pad out each data frame with the missing columns, then rbind them:
allnames <- unique(unlist(lapply(L1, names)))
do.call(rbind, lapply(L1, function(df) {
not <- allnames[!allnames %in% names(df)]
df[, not] <- 0
df
}))
I have an old (and probably inefficient) function that does this. I've made one modification here to allow the fill to be specified.
RBIND <- function(datalist, keep.rownames = TRUE, fill = NA) {
Len <- sapply(datalist, ncol)
if (all(diff(Len) == 0)) {
temp <- names(datalist[[1]])
if (all(sapply(datalist, function(x) names(x) %in% temp))) tryme <- "basic"
else tryme <- "complex"
}
else tryme <- "complex"
almost <- switch(
tryme,
basic = { do.call("rbind", datalist) },
complex = {
Names <- unique(unlist(lapply(datalist, names)))
NROWS <- c(0, cumsum(sapply(datalist, nrow)))
NROWS <- paste(NROWS[-length(NROWS)]+1, NROWS[-1], sep=":")
out <- lapply(1:length(datalist), function(x) {
emptyMat <- matrix(fill, nrow = nrow(datalist[[x]]), ncol = length(Names))
colnames(emptyMat) <- Names
emptyMat[, match(names(datalist[[x]]),
colnames(emptyMat))] <- as.matrix(datalist[[x]])
emptyMat
})
do.call("rbind", out)
})
Final <- as.data.frame(almost, row.names = 1:nrow(almost))
Final <- data.frame(lapply(Final, function(x) type.convert(as.character(x))))
if (isTRUE(keep.rownames)) {
row.names(Final) <- make.unique(unlist(lapply(datalist, row.names)))
}
Final
}
Here it is on your sample data.
RBIND(L1, fill = 0)
# a b c r f v t
# 1 1 1 1 0 0 0 0
# 2 2 2 2 0 0 0 0
# 1.1 1 0 0 1 1 0 0
# 2.1 2 0 0 2 2 0 0
# 1.2 0 1 1 0 0 1 A
# 2.2 0 2 2 0 0 2 A
# 3 0 3 3 0 0 3 D

R: combine rows of a matrix by group

I am attempting to reformat the data set my.data to obtain the output shown below the my.data2 statement. Specifically, I want to put the last 4 columns of my.data on one line per record.id, where the last four
columns of my.data will occupy columns 2-5 of the new data matrix if group=1 and columns 6-9 if group=2.
I wrote the cumbersome code below, but the double for-loop is causing an error that I simply cannot locate.
Even if the double for-loop worked, I suspect there is a much more efficient way of accomplishing the
same thing - (maybe reshape?)
Thank you for any help correcting the double for-loop or with more efficient code.
my.data <- "record.id group s1 s2 s3 s4
1 1 2 0 1 3
1 2 0 0 0 12
2 1 0 0 0 0
3 1 10 0 0 0
4 1 1 0 0 0
4 2 0 0 0 0
8 2 0 2 2 0
9 1 0 0 0 0
9 2 0 0 0 0"
my.data2 <- read.table(textConnection(my.data), header=T)
# desired output
#
# 1 2 0 1 3 0 0 0 12
# 2 0 0 0 0 0 0 0 0
# 3 10 0 0 0 0 0 0 0
# 4 1 0 0 0 0 0 0 0
# 8 0 0 0 0 0 2 2 0
# 9 0 0 0 0 0 0 0 0
Code:
dat_sorted <- sort(unique(my.data2[,1]))
my.seq <- match(my.data2[,1],dat_sorted)
my.data3 <- cbind(my.seq, my.data2)
group.min <- tapply(my.data3$group, my.data3$my.seq, min)
group.max <- tapply(my.data3$group, my.data3$my.seq, max)
# my.min <- group.min[my.data3[,1]]
# my.max <- group.max[my.data3[,1]]
my.records <- matrix(0, nrow=length(unique(my.data3$record.id)), ncol=9)
x <- 1
for(i in 1:max(my.data3$my.seq)) {
for(j in group.min[i]:group.max[i]) {
if(my.data3[x,1] == i) my.records[i,1] = i
# the two lines below seem to be causing an error
if((my.data3[x,1] == i) & (my.data3[x,3] == 1)) (my.records[i,2:5] = my.data3[x,4:7])
if((my.data3[x,1] == i) & (my.data3[x,3] == 2)) (my.records[i,6:9] = my.data3[x,4:7])
x <- x + 1
}
}
You are right, reshape helps here.
library(reshape2)
m <- melt(my.data2, id.var = c("record.id", "group"))
dcast(m, record.id ~ group + variable, fill = 0)
record.id 1_s1 1_s2 1_s3 1_s4 2_s1 2_s2 2_s3 2_s4
1 1 2 0 1 3 0 0 0 12
2 2 0 0 0 0 0 0 0 0
3 3 10 0 0 0 0 0 0 0
4 4 1 0 0 0 0 0 0 0
5 8 0 0 0 0 0 2 2 0
6 9 0 0 0 0 0 0 0 0
Comparison:
dfTest <- data.frame(record.id = rep(1:10e5, each = 2), group = 1:2,
s1 = sample(1:10, 10e5 * 2, replace = TRUE),
s2 = sample(1:10, 10e5 * 2, replace = TRUE),
s3 = sample(1:10, 10e5 * 2, replace = TRUE),
s4 = sample(1:10, 10e5 * 2, replace = TRUE))
system.time({
...# Your code
})
Error in my.records[i, 1] = i : incorrect number of subscripts on matrix
Timing stopped at: 41.61 0.36 42.56
system.time({m <- melt(dfTest, id.var = c("record.id", "group"))
dcast(m, record.id ~ group + variable, fill = 0)})
user system elapsed
25.04 2.78 28.72
Julius' answer is better, but for completeness, I think I managed to get the following for-loop to work:
dat_x <- (unique(my.data2[,1]))
my.seq <- match(my.data2[,1],dat_x)
my.data3 <- as.data.frame(cbind(my.seq, my.data2))
my.records <- matrix(0, nrow=length(unique(my.data3$record.id)), ncol=9)
my.records <- as.data.frame(my.records)
my.records[,1] = unique(my.data3[,2])
for(i in 1:9) {
if(my.data3[i,3] == 1) (my.records[my.data3[i,1],c(2:5)] = my.data3[i,c(4:7)])
if(my.data3[i,3] == 2) (my.records[my.data3[i,1],c(6:9)] = my.data3[i,c(4:7)])
}

Loop to perform calculations across rows on specific columns matching a pattern (in data frame)?

I have a dataframe with some boolean values (1/0) as follows (sorry I couldn't work out how to make this into a smart table)
Flag1.Sam Flag2.Sam Flag3.Sam Flag1.Ted Flag2.Ted Flag3.Ted
probe1 0 1 0 1 0 0
probe2 0 0 0 0 0 0
probe3 1 0 0 0 0 0
probe4 0 0 0 0 0 0
probe5 1 1 0 1 0 0
I have 64 samples (Sam/Ted....etc) which are in a list called files i.e;
files <- c("Sam", "Ted", "Ann", ....)
And I would like to create a a column summing the flag values for each sample to create the following:
Sam Ted
probe1.flagsum 1 1
probe2.flagsum 0 0
probe3.flagsum 1 0
probe4.flagsum 0 0
probe5.flagsum 2 1
I am fairly new to R, trying to learn on a need to know basis but I have tried the following:
for(i in files) {
FLAGS$i <- cbind(sapply(i, function(y) {
#greping columns to filter for one sample
filter1 <- grep(names(filters), pattern=y)
#print out the summed values for those columns
FLAGS$y <-rowSums(filters[,(filter1)])
}
}
The above code does not work and I am bit lost as how to move forward.
Can anyone help me untangle this problem or point me in the right direction of the commands/tools to use.
Thank you.
This is easily doable in base R reshape, though using the reshape or reshape2 packages might be more intuitive.
Here's a solution in base R:
# Here's your data in its current form
dat = read.table(header=TRUE, text="Flag1.Sam Flag2.Sam Flag3.Sam Flag1.Ted Flag2.Ted Flag3.Ted
probe1 0 1 0 1 0 0
probe2 0 0 0 0 0 0
probe3 1 0 0 0 0 0
probe4 0 0 0 0 0 0
probe5 1 1 0 1 0 0")
# Generate an ID row
dat$id = row.names(dat)
# Reshape wide to long
r.dat = reshape(dat, direction="long",
timevar="probe",
varying=1:6, sep=".")
# Calculate row sums
r.dat$sum = rowSums(r.dat[3:5])
# Reshape back to wide format, dropping what you're not interested in
reshape(r.dat, direction="wide",
idvar="id", timevar="probe",
drop=3:5)
## id sum.Sam sum.Ted
## probe1.Sam probe1 1 1
## probe2.Sam probe2 0 0
## probe3.Sam probe3 1 0
## probe4.Sam probe4 0 0
## probe5.Sam probe5 2 1
More than one way to skin a cat
You can also whip up a function like this one:
myFun = function(data, varnames) {
temp = vector("list", length(varnames))
for (i in 1:length(varnames)) {
temp[[i]] = colSums(t(dat[grep(varnames[i], names(data))]))
names(temp)[[i]] = varnames[i]
}
data.frame(temp)
}
Then, making use of the vector that you have of names:
files = c("Sam", "Ted")
myFun(dat, files)
## Sam Ted
## probe1 1 1
## probe2 0 0
## probe3 1 0
## probe4 0 0
## probe5 2 1
Enjoy!
If filters is your input matrix and FLAGS your desired output matrix then I would (naïvely) do something like this:
FLAGS <- matrix(0,nrow=nrow(filters),ncol=length(files))
for(i in 1:length(files)){
grep(files[i],colnames(filters)) -> index
FLAGS[,i] <- rowSums(filters[,index])
}
colnames(FLAGS) <- files
assuming your matrix is called input
input <- matrix(rbinom(30, 1, 0.5), ncol = 6)
colnames(input) <- c("F1.S", "F2.S", "F3.S", "F1.T", "F2.T", "F3.T")
rownames(input) <- paste("probe", 1:5, sep = "")
input <- as.data.frame(input)
library(reshape)
input$probe <- rownames(input)
Molten <- melt(input, id.vars = "probe")
Molten$ID <- gsub("^.*\\.", "", levels(Molten$variable))[Molten$variable]
cast(probe ~ ID, data = Molten, fun = "sum")
update with the dat frame from mrdwab
dat = read.table(header=TRUE, text="Flag1.Sam Flag2.Sam Flag3.Sam Flag1.Ted Flag2.Ted Flag3.Ted
probe1 0 1 0 1 0 0
probe2 0 0 0 0 0 0
probe3 1 0 0 0 0 0
probe4 0 0 0 0 0 0
probe5 1 1 0 1 0 0")
library(reshape)
dat$probe <- rownames(dat)
Molten <- melt(dat, id.vars = "probe")
Molten$ID <- gsub("^.*\\.", "", levels(Molten$variable))[Molten$variable]
cast(probe ~ ID, data = Molten, fun = "sum")

What is the least memory demanding methods to do these steps?

I have posted a question yesterday, and got wonderful response from the experts. However, I am facing another question now, I found the jobs cannot be done in my real data as my starting file (df1) are too large. I wonder if there are faster method to do the same job without using adply or for loop?
My original questions is listed as below:
Step 1: I have a simplified dataframe like this:
df1 = data.frame (B=c(1,0,1), C=c(1,1,0)
, D=c(1,0,1), E=c(1,1,0), F=c(0,0,1)
, G=c(0,1,0), H=c(0,0,1), I=c(0,1,0))
B C D E F G H I
1 1 1 1 1 0 0 0 0
2 0 1 0 1 0 1 0 1
3 1 0 1 0 1 0 1 0
Step 2: I want to do row wise subtraction, i.e. (row1 - row2), (row1 - row3) and (row2 - row3)
row1-row2 1 0 1 0 0 -1 0 -1
row1-row3 0 1 0 1 -1 0 -1 0
row2-row3 -1 1 -1 1 -1 1 -1 1
step 3: replace all -1 to 0
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1
Could you mind to teach me how to do so in a less memory-demanding approach?
The fastest way I know to do step 2 is to use indices into df1 for the various pairwise comparisons you want to do. The combn() function can be used to generate the set of row-by-row comparisons required. (Using this will be the rate limiting step for big data sets.)
For the combinations of row-by-rows operations we want to form:
> cmb <- combn(as.numeric(rownames(df1)), 2)
> cmb
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 3
The rows of cmb represent the two sets of indices required from df1 required to form the three rows of your requested output. (The columns, 3, represent the 3 rows in your expected result.)
The next step is to use the two rows of cmb to index df1 and use a standard vectorised operation in R via -, e.g.:
> (out <- df1[cmb[1,], ] - df1[cmb[2,], ])
B C D E F G H I
1 1 0 1 0 0 -1 0 -1
1.1 0 1 0 1 -1 0 -1 0
2 -1 1 -1 1 -1 1 -1 1
Step 3 can now be done, although I am assuming that there can only be 1, 0, and -1 values in the resulting output:
> out[out < 0] <- 0
> out
B C D E F G H I
1 1 0 1 0 0 0 0 0
1.1 0 1 0 1 0 0 0 0
2 0 1 0 1 0 1 0 1
Which is consistent with the output you requested.
For big operations, doing this with matrices might be faster overall. So we could do:
> mat <- data.matrix(df1)
> cmb <- combn(seq_len(NROW(mat)), 2)
> cmb
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 3
> out2 <- mat[cmb[1,], ] - mat[cmb[2,], ]
> out2[out2 < 0] <- 0
> out2
B C D E F G H I
[1,] 1 0 1 0 0 0 0 0
[2,] 0 1 0 1 0 0 0 0
[3,] 0 1 0 1 0 1 0 1
If you need the rownames as you show, then you can easily generate these at the end:
> apply(cmb, 2, function(x) paste("row", x[1], "-row", x[2], sep = ""))
[1] "row1-row2" "row1-row3" "row2-row3"
which can be used as:
> rownames(out) <- apply(cmb, 2, function(x) paste("row", x[1], "-row", x[2], sep = ""))
> out
B C D E F G H I
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1
Using the sqldf package or RSQLite directly would allow one to do this with all computations done outside of R so that there would be no intermediate storage required. We illustrate using sqldf. See the sqldf home page for more info.
Alternative 1 In this approach note that we use dbname = tempfile() so that it performs all computations in an external database (which it creates on the fly and automatically deletes) rather than doing it in memory.
library(sqldf)
gc()
DF <- sqldf("select x.rowid x, y.rowid y,
max(x.B - y.B, 0) B, max(x.C - y.C, 0) C,
max(x.D - y.D, 0) D, max(x.E - y.E, 0) E,
max(x.F - y.F, 0) F, max(x.G - y.G, 0) G,
max(x.H - y.H, 0) H, max(x.I - y.I, 0) I
from df1 x, df1 y
where x.rowid > y.rowid", dbname = tempfile())
This would only require that we are able to store df1 and DF in our workspace.
Alternative 2. If even that overflows we can write out df1, remove it, perform the calculation below and then we would only need sufficient storage to store the result, DF.
read.csv.sql uses dbname = tempfile() by default so in this case we do not need to specify it.
write.table(df1, "data.txt", sep = ",", quote = FALSE)
rm(df1)
gc()
DF <- read.csv.sql("data.txt", sql = "select
x.rowid x, y.rowid y,
max(x.B - y.B, 0) B, max(x.C - y.C, 0) C,
max(x.D - y.D, 0) D, max(x.E - y.E, 0) E,
max(x.F - y.F, 0) F, max(x.G - y.G, 0) G,
max(x.H - y.H, 0) H, max(x.I - y.I, 0) I
from file x, file y
where x.rowid > y.rowid")
(Of course, if its really this large then you might have trouble doing any subsequent calculations on it too.)
Output. At any rate, both alternatives give the same result shown below. x and y show which input rows were subtracted.
> DF
x y B C D E F G H I
1 2 1 0 0 0 0 0 1 0 1
2 3 1 0 0 0 0 1 0 1 0
3 3 2 1 0 1 0 1 0 1 0
Note. Although the question asked for optimizing memory rather than speed if speed were an issue one could add indexes.
Since the data is homogeneous, use a matrix representation. Organize it so that the 'rows' are columns, as
m <- t(as.matrix(df1))
mode(m) <- "integer" # maybe already true?
pre-allocate the space for an answer
n <- ncol(m) - 1
ans <- matrix(0L, nrow(m), (n+1) * n / 2)
We want to compare column 1 to columns 1:n + 1L (the 1L treats the number one as an integer value, rather than real). This is m[,1] - m[, 1:n + 1L], using R's recycling. Iterating over columns, with idx and off helping to keep track of the index of the columns we want to compare to, and the placement columns in the answer
off <- 0
for (i in 1:n) {
idx <- i:n + 1L
ans[, off + seq_along(idx)] <- m[, i] - m[, idx]
off <- off + length(idx)
}
The final step is
ans[ans<0L] <- 0L
Maybe there are additional efficiencies from realizing that the truth table under the original operation is 0 unless m[,1] == 1 & m[, 1:n + 1L] == 0. Likewise if space were a serious issue then the data might be represented as mode(m) <- "raw" and the arithmetic operations replaced with the comparison just suggested, along the lines of:
m <- t(as.matrix(df1))
mode(m) <- "raw"
off <- 0
x0 <- as.raw(0); x1 <- as.raw(1)
ans <- matrix(raw(), nrow(m), (n+1) * n / 2)
for (i in 1:n) {
idx <- i:n + 1L
updt <- which((m[, i] == x1) & (m[, idx] == x0))
ans[off + updt] <- x1
off <- off + length(idx) * nrow(ans)
}

Resources