What is the least memory demanding methods to do these steps?

What is the least memory demanding methods to do these steps? - r

I have posted a question yesterday, and got wonderful response from the experts. However, I am facing another question now, I found the jobs cannot be done in my real data as my starting file (df1) are too large. I wonder if there are faster method to do the same job without using adply or for loop?
My original questions is listed as below:
Step 1: I have a simplified dataframe like this:
df1 = data.frame (B=c(1,0,1), C=c(1,1,0)
, D=c(1,0,1), E=c(1,1,0), F=c(0,0,1)
, G=c(0,1,0), H=c(0,0,1), I=c(0,1,0))
B C D E F G H I
1 1 1 1 1 0 0 0 0
2 0 1 0 1 0 1 0 1
3 1 0 1 0 1 0 1 0
Step 2: I want to do row wise subtraction, i.e. (row1 - row2), (row1 - row3) and (row2 - row3)
row1-row2 1 0 1 0 0 -1 0 -1
row1-row3 0 1 0 1 -1 0 -1 0
row2-row3 -1 1 -1 1 -1 1 -1 1
step 3: replace all -1 to 0
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1
Could you mind to teach me how to do so in a less memory-demanding approach?

The fastest way I know to do step 2 is to use indices into df1 for the various pairwise comparisons you want to do. The combn() function can be used to generate the set of row-by-row comparisons required. (Using this will be the rate limiting step for big data sets.)
For the combinations of row-by-rows operations we want to form:
> cmb <- combn(as.numeric(rownames(df1)), 2)
> cmb
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 3
The rows of cmb represent the two sets of indices required from df1 required to form the three rows of your requested output. (The columns, 3, represent the 3 rows in your expected result.)
The next step is to use the two rows of cmb to index df1 and use a standard vectorised operation in R via -, e.g.:
> (out <- df1[cmb[1,], ] - df1[cmb[2,], ])
B C D E F G H I
1 1 0 1 0 0 -1 0 -1
1.1 0 1 0 1 -1 0 -1 0
2 -1 1 -1 1 -1 1 -1 1
Step 3 can now be done, although I am assuming that there can only be 1, 0, and -1 values in the resulting output:
> out[out < 0] <- 0
> out
B C D E F G H I
1 1 0 1 0 0 0 0 0
1.1 0 1 0 1 0 0 0 0
2 0 1 0 1 0 1 0 1
Which is consistent with the output you requested.
For big operations, doing this with matrices might be faster overall. So we could do:
> mat <- data.matrix(df1)
> cmb <- combn(seq_len(NROW(mat)), 2)
> cmb
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 3
> out2 <- mat[cmb[1,], ] - mat[cmb[2,], ]
> out2[out2 < 0] <- 0
> out2
B C D E F G H I
[1,] 1 0 1 0 0 0 0 0
[2,] 0 1 0 1 0 0 0 0
[3,] 0 1 0 1 0 1 0 1
If you need the rownames as you show, then you can easily generate these at the end:
> apply(cmb, 2, function(x) paste("row", x[1], "-row", x[2], sep = ""))
[1] "row1-row2" "row1-row3" "row2-row3"
which can be used as:
> rownames(out) <- apply(cmb, 2, function(x) paste("row", x[1], "-row", x[2], sep = ""))
> out
B C D E F G H I
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1

Using the sqldf package or RSQLite directly would allow one to do this with all computations done outside of R so that there would be no intermediate storage required. We illustrate using sqldf. See the sqldf home page for more info.
Alternative 1 In this approach note that we use dbname = tempfile() so that it performs all computations in an external database (which it creates on the fly and automatically deletes) rather than doing it in memory.
library(sqldf)
gc()
DF <- sqldf("select x.rowid x, y.rowid y,
max(x.B - y.B, 0) B, max(x.C - y.C, 0) C,
max(x.D - y.D, 0) D, max(x.E - y.E, 0) E,
max(x.F - y.F, 0) F, max(x.G - y.G, 0) G,
max(x.H - y.H, 0) H, max(x.I - y.I, 0) I
from df1 x, df1 y
where x.rowid > y.rowid", dbname = tempfile())
This would only require that we are able to store df1 and DF in our workspace.
Alternative 2. If even that overflows we can write out df1, remove it, perform the calculation below and then we would only need sufficient storage to store the result, DF.
read.csv.sql uses dbname = tempfile() by default so in this case we do not need to specify it.
write.table(df1, "data.txt", sep = ",", quote = FALSE)
rm(df1)
gc()
DF <- read.csv.sql("data.txt", sql = "select
x.rowid x, y.rowid y,
max(x.B - y.B, 0) B, max(x.C - y.C, 0) C,
max(x.D - y.D, 0) D, max(x.E - y.E, 0) E,
max(x.F - y.F, 0) F, max(x.G - y.G, 0) G,
max(x.H - y.H, 0) H, max(x.I - y.I, 0) I
from file x, file y
where x.rowid > y.rowid")
(Of course, if its really this large then you might have trouble doing any subsequent calculations on it too.)
Output. At any rate, both alternatives give the same result shown below. x and y show which input rows were subtracted.
> DF
x y B C D E F G H I
1 2 1 0 0 0 0 0 1 0 1
2 3 1 0 0 0 0 1 0 1 0
3 3 2 1 0 1 0 1 0 1 0
Note. Although the question asked for optimizing memory rather than speed if speed were an issue one could add indexes.

Since the data is homogeneous, use a matrix representation. Organize it so that the 'rows' are columns, as
m <- t(as.matrix(df1))
mode(m) <- "integer" # maybe already true?
pre-allocate the space for an answer
n <- ncol(m) - 1
ans <- matrix(0L, nrow(m), (n+1) * n / 2)
We want to compare column 1 to columns 1:n + 1L (the 1L treats the number one as an integer value, rather than real). This is m[,1] - m[, 1:n + 1L], using R's recycling. Iterating over columns, with idx and off helping to keep track of the index of the columns we want to compare to, and the placement columns in the answer
off <- 0
for (i in 1:n) {
idx <- i:n + 1L
ans[, off + seq_along(idx)] <- m[, i] - m[, idx]
off <- off + length(idx)
}
The final step is
ans[ans<0L] <- 0L
Maybe there are additional efficiencies from realizing that the truth table under the original operation is 0 unless m[,1] == 1 & m[, 1:n + 1L] == 0. Likewise if space were a serious issue then the data might be represented as mode(m) <- "raw" and the arithmetic operations replaced with the comparison just suggested, along the lines of:
m <- t(as.matrix(df1))
mode(m) <- "raw"
off <- 0
x0 <- as.raw(0); x1 <- as.raw(1)
ans <- matrix(raw(), nrow(m), (n+1) * n / 2)
for (i in 1:n) {
idx <- i:n + 1L
updt <- which((m[, i] == x1) & (m[, idx] == x0))
ans[off + updt] <- x1
off <- off + length(idx) * nrow(ans)
}

Related

find peaks and valleys for each column of a dataframe

I have a dataframe df, I would like to find peaks and valleys for each column and then replace the points where peaks and valleys are present with the value 1.
Here I made an example by applying it to only one column.
Is it possible to do this for all the columns in the dataframe?
df <- data.frame(a = sample(1:10,10),
b = sample(1:10,10),
c = sample(1:10,10),
d = sample(1:10,10),
e = sample(1:10,10))
vallys<- findValleys(df$b, thresh =0)
peaks <- findPeaks(df$b, thresh = 0)
df$b <- rep(0, nrow(df))
df$b <- replace(df$b, peaks, values=1)
df$b <- replace(df$b, vallys, values=1)
Thank you

The easiest thing is to put your code into a function.
library(quantmod)
replace_peaks_valleys <- function(x) {
valleys <- findValleys(x, thresh = 0)
peaks <- findPeaks(x, thresh = 0)
new_col <- rep(0, length(x))
new_col <- replace(new_col, peaks, values = 1)
new_col <- replace(new_col, valleys, values = 1)
return(new_col)
}
Then you can choose whether to do it in base R, dplyr or data.table.
base R
As you want to assign back to your original data frame, in base R you can do (note the square brackets or it will return a list):
df[] <- lapply(df, replace_peaks_valleys)
head(df)
# a b c d e
# 1 0 0 0 0 0
# 2 0 0 0 0 0
# 3 1 1 1 1 1
# 4 1 0 1 1 0
# 5 1 1 0 1 0
# 6 0 1 1 1 1
dplyr
Alternatively, with dplyr you can just do:
library(dplyr)
df |>
mutate(
across(
a:e, replace_peaks_valleys
)
)
# a b c d e
# 1 0 0 0 0 0
# 2 0 0 0 0 0
# 3 1 1 1 1 1
# 4 1 0 1 1 0
# <etc>
data.table
You can also do this with data.table:
library(data.table)
dt <- setDT(df)
dt[, lapply(.SD, replace_peaks_valleys)]
# a b c d e
# 1: 0 0 0 0 0
# 2: 0 0 0 0 0
# 3: 1 0 1 1 1
# 4: 1 1 0 0 0
# <etc>
N.B. I used set.seed(1) before I ran your code - if you do this as well you should exactly the same output.
Function definition
I just copied and pasted your code and made it into a function. You could change it so you assign 0 or 1 to the existing vector, rather than creating a new vector every time:
replace_peaks_valleys2 <- function(x) {
valleys <- findValleys(x, thresh = 0)
peaks <- findPeaks(x, thresh = 0)
x[] <- 0
x[c(peaks,valleys)] <- 1
return(x)
}

Alternative ways to create a repetitive vector in R

I am tasked to create the vector
0 1 0 1 0 1 0 1 0 1
using two approaches without using c() or rep() in R.
I have tried a bunch of methods, but none of them seem to work.
Here are some of my attempts (all of which have failed) -
vector(0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
a<-seq(from = 0, to = 1 , by = 1)
a
replicate(5, a)
b<-1*(0:1)
do.call(cbind, replicate(5, b, simplify=FALSE))
Any help on this would be appreciated! Thank you.

We can use bitwAnd
> bitwAnd(0:9, 1)
[1] 0 1 0 1 0 1 0 1 0 1
or kronecker
> kronecker(as.vector(matrix(1, 5)), 0:1)
[1] 0 1 0 1 0 1 0 1 0 1
> kronecker((1:5)^0, 0:1)
[1] 0 1 0 1 0 1 0 1 0 1
or outer
> as.vector(outer(0:1, (1:5)^0))
[1] 0 1 0 1 0 1 0 1 0 1

Solution 1: Generalized Function my_rep()
A generalized solution my_rep() for any vector x you wish repeated n times
my_rep <- function(x, n) {
return(
# Use modulo '%%' to subscript the original vector (whose length I'll call "m"), by
# cycling 'n' times through its indices.
x[0:(length(x) * n - 1) %% length(x) + 1]
# 1 2 ... m 1 2 ... m 1 2 ... m
# | 1st cycle | | 2nd cycle | ... | nth cycle |
)
}
which can solve this case
my_rep(x = 0:1, n = 5)
# [1] 0 1 0 1 0 1 0 1 0 1
and many others
# Getting cute, to make a vector of strings without using 'c()'.
str_vec <- strsplit("a b ", split = " ")[[1]]
str_vec
# [1] "a" "b" ""
my_rep(x = str_vec, n = 3)
# [1] "a" "b" "" "a" "b" "" "a" "b" ""
Solution 2: Binary Vector of Arbitrary Length
Another quick solution, for a 0 1 0 1 ... 0 1 vector of arbitrary length l
# Whatever length you desire.
l <- 10
# Generate a vector of alternating 0s and 1s, of length 'l'.
(1:l - 1) %% 2
which yields the output:
[1] 0 1 0 1 0 1 0 1 0 1
Note
Special thanks to #Adam, who figured out 0:9 %% 2 on their own, shortly after my comment with that same solution; and who gracefully retracted their initial answer in favor of mine. :)

Exploiting boolean coercion.
+(1:10*c(-1, 1) > 0)
# [1] 0 1 0 1 0 1 0 1 0 1
Or without c().
+(1:10*(0:1*2) - 1 > 0)
# [1] 0 1 0 1 0 1 0 1 0 1

Here is a way using the apply functions.
unlist(lapply(1:5, function(x) 0:1))
# [1] 0 1 0 1 0 1 0 1 0 1
Similar but with replicate.
as.vector(replicate(5, 0:1))
# [1] 0 1 0 1 0 1 0 1 0 1
And just in case you love trig.
abs(as.integer(cos((1:10 * pi) / 2)))
# [1] 0 1 0 1 0 1 0 1 0 1
And here is one last one that I consider cheating just because. This one generalizes to any vector you want!
unlist(unname(read.table(textConnection("0 1 0 1 0 1 0 1 0 1"))))

We can use purrr::accumulate, and a simple negate(!) operation.
accumulate will perform the same operation recursively over its data argument and output all intermediate results.
In this case, it can be broken down into:
output[1] <-0
output[2] <-!output[1]
output[3] <-!output[2]
...
the output would then be c(0, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE), which is coerced to numeric.
purrr::accumulate(0:9, ~!.x)
[1] 0 1 0 1 0 1 0 1 0 1

Firstly we will make a list of given no. and then apply unlist() function on list to convert it into a vector as shown in below code:
my_list = list(0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
v = unlist(my_list)
print(v)
[ 1 ] 0 1 0 1 0 1 0 1 0 1

R with/without dplyr: create new columns as combinations of prev. columns per row

Suppose I have a data frame like this:
A B C
1 0 1
0 1 1
1 0 0
I would want to product the following derivative using dplyr (or other lib):
A B C AB AC BC
1 0 1 0 1 0
0 1 1 0 0 1
1 0 0 0 0 0
So, I would want to automatically create new columns in the data frame, where their values will be the products of the initial column set (so in this case 3 products for each row - A*B, A*C and B*C). The gist is to do that automatically (I have 6 columns I can't code all combinations). The names of automatically created columns should have some naming scheme since I will need to filter them later.

We can use combn to get the column combination and then use a for loop to create new columns.
# Create example data frame
dat <- read.table(text = "A B C
1 0 1
0 1 1
1 0 0",
header = TRUE)
# Create the column name combination
m <- combn(names(dat), m = 2)
# Create new columns
for (i in 1:ncol(m)){
dat[paste(m[, i], collapse = "")] <- dat[m[1, i]] * dat[m[2, i]]
}
dat
# A B C AB AC BC
# 1 1 0 1 0 1 0
# 2 0 1 1 0 0 1
# 3 1 0 0 0 0 0

Sometimes it's best to code without thinking too hard:
df <- data.frame(A = c(1, 0, 1),
B = c(0, 1, 0),
C = c(1, 1, 0))
J <- K <- seq_along(df)
J_n <- K_n <- names(df)
for (j in J) {
for (k in K) {
if (j < k) {
j_name <- J_n[j]
k_name <- K_n[k]
df[[paste0(j_name, k_name)]] <- df[[j]] * df[[k]]
}
}
}
This assumes that the new names are not present in the original data frame. So if your original data frame contained columns A, B, and AB, this won't work.

removing columns equal to 0 from multiple data frames in a list; lapply not actually removing columns when applying function to a list

I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated

The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1

Join and sum not compatible matrices

My goal is to "sum" two not compatible matrices (matrices with different dimensions) using (and preserving) row and column names.
I've figured this approach: convert the matrices to data.table objects, join them and then sum columns vectors.
An example:
> M1
1 3 4 5 7 8
1 0 0 1 0 0 0
3 0 0 0 0 0 0
4 1 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0
> M2
1 3 4 5 8
1 0 0 1 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 0 0 0 0 0
8 0 0 0 0 0
> M1 %ms% M2
1 3 4 5 7 8
1 0 0 2 0 0 0
3 0 0 0 0 0 0
4 2 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0
This is my code:
M1 <- matrix(c(0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0), byrow = TRUE, ncol = 6)
colnames(M1) <- c(1,3,4,5,7,8)
M2 <- matrix(c(0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0), byrow = TRUE, ncol = 5)
colnames(M2) <- c(1,3,4,5,8)
# to data.table objects
DT1 <- data.table(M1, keep.rownames = TRUE, key = "rn")
DT2 <- data.table(M2, keep.rownames = TRUE, key = "rn")
# join and sum of common columns
if (nrow(DT1) > nrow(DT2)) {
A <- DT2[DT1, roll = TRUE]
A[, list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1), by = rn]
}
That outputs:
rn X1 X3 X4 X5 X7 X8
1: 1 0 0 2 0 0 0
2: 3 0 0 0 0 0 0
3: 4 2 0 0 0 0 0
4: 5 0 0 0 0 0 0
5: 7 0 0 0 0 1 0
6: 8 0 0 0 0 0 0
Then I can convert back this data.table to a matrix and fix row and column names.
The questions are:
how to generalize this procedure?
I need a way to automatically create list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1) because i want to apply this function to matrices which dimensions (and row/columns names) are not known in advance.
In summary I need a merge procedure that behaves as described.
there are other strategies/implementations that achieve the same goal that are, at the same time, faster and generalized? (hoping that some data.table monster help me)
to what kind of join (inner, outer, etc. etc.) is assimilable this procedure?
Thanks in advance.
p.s.: I'm using data.table version 1.8.2
EDIT - SOLUTIONS
#Aaron solution. No external libraries, only base R. It works also on list of matrices.
add_matrices_1 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
out <- array(0, dim = c(length(rows), length(cols)), dimnames = list(rows,cols))
for (m in a) out[rownames(m), colnames(m)] <- out[rownames(m), colnames(m)] + m
out
}
#MadScone solution. Use reshape2 package. It works only on two matrices per call.
add_matrices_2 <- function(m1, m2) {
m <- acast(rbind(melt(M1), melt(M2)), Var1~Var2, fun.aggregate = sum)
mn <- unique(colnames(m1), colnames(m2))
rownames(m) <- mn
colnames(m) <- mn
m
}
#Aaron solution. Use Matrix package. It work only on sparse matrices, also on list of them.
add_matrices_3 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
nrows <- length(rows)
ncols <- length(cols)
newms <- lapply(a, function(m) {
s <- summary(m)
i <- match(rownames(m), rows)[s$i]
j <- match(colnames(m), cols)[s$j]
ilj <- i < j
sparseMatrix(
i = ifelse(ilj, i, j),
j = ifelse(ilj, j, i),
x = s$x,
dims = c(nrows, ncols),
dimnames = list(rows, cols),
symmetric = TRUE
)
})
Reduce(`+`, newms)
}
BENCHMARK (100 runs with microbenchmark package)
Unit: microseconds
expr min lq median uq max
1 add_matrices_1 196.009 257.5865 282.027 291.2735 549.397
2 add_matrices_2 13737.851 14697.9790 14864.778 16285.7650 25567.448
No need to comment the benchmark: #Aaron solution wins.
Details
For insights about performances (that depend of the size and the sparsity of the matrices) see #Aaron's edit (and the solution for sparse matrices: add_matrices_3).

I'd just line up the names and go to town with base R.
Here's a simple function that takes an unspecified number of matrices and adds them up by their row/column names.
add_matrices_1 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
out <- array(0, dim=c(length(rows), length(cols)), dimnames=list(rows,cols))
for(M in a) { out[rownames(M), colnames(M)] <- out[rownames(M), colnames(M)] + M }
out
}
It then works like this:
# giving them rownames and colnames
colnames(M1) <- rownames(M1) <- c(1,3,4,5,7,8)
colnames(M2) <- rownames(M2) <- c(1,3,4,5,8)
add_matrices_1(M1, M2)
# 1 3 4 5 7 8
# 1 0 0 2 0 0 0
# 3 0 0 0 0 0 0
# 4 2 0 0 0 0 0
# 5 0 0 0 0 0 0
# 7 0 0 0 0 1 0
# 8 0 0 0 0 0 0
For bigger matrices, however, it doesn't do as well. Here's a function to make a matrix, choosing n columns out of N possibilities, and filling k spots with non-zero values. (This assumes symmetrical matrices.)
makeM <- function(N, n, k) {
s1 <- sample(N, n)
M1 <- array(0, dim=c(n,n), dimnames=list(s1, s1))
r1 <- sample(n,k, replace=TRUE)
c1 <- sample(n,k, replace=TRUE)
M1[cbind(c(r1,c1), c(c1,r1))] <- sample(N,k)
M1
}
Then here's another version that uses sparse matrices.
add_matrices_3 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
nrows <- length(rows)
ncols <- length(cols)
newms <- lapply(a, function(m) {
s <- summary(m)
i <- match(rownames(m), rows)[s$i]
j <- match(colnames(m), cols)[s$j]
ilj <- i<j
sparseMatrix(i=ifelse(ilj, i, j),
j=ifelse(ilj, j, i),
x=s$x,
dims=c(nrows, ncols),
dimnames=list(rows, cols), symmetric=TRUE)
})
Reduce(`+`, newms)
}
This version is definitely faster when the matrices are large and sparse. (Note that I'm not timing the conversion to a sparse symmetric matrix, as hopefully if that's a suitable format, you'll use that format throughout your code.)
set.seed(50)
M1 <- makeM(10000, 5000, 50)
M2 <- makeM(10000, 5000, 50)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
system.time(add_matrices_1(M1, M2))
# user system elapsed
# 2.987 0.841 4.133
system.time(add_matrices_3(mm1, mm2))
# user system elapsed
# 0.042 0.012 0.504
But when the matrices are small, my first solution is still faster.
set.seed(50)
M1 <- makeM(100, 50, 20)
M2 <- makeM(100, 50, 20)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
microbenchmark(add_matrices_1(M1, M2), add_matrices_3(mm1, mm2))
# Unit: microseconds
# expr min lq median uq max
# 1 add_matrices_1(M1, M2) 398.495 406.543 423.825 544.0905 43077.27
# 2 add_matrices_3(mm1, mm2) 5734.623 5937.473 6044.007 6286.6675 509584.24
Moral of the story: Size and sparsity matter.
Also, getting it right is more important than saving a few microseconds. It's almost always best to use simple functions and don't worry about speed unless you run into trouble. So in small cases, I'd prefer MadScone's solution, as it's easy to code and simple to understand. When that gets slow, I'd write a function like my first attempt. When that gets slow, I'd write a function like my second attempt.

Here is a data.table solution. The magic is to add the .SD components (which have identical names in both) then assign the remaining column by reference.
# a function to quickly get the non key columns
nonkey <- function(DT){ setdiff(names(DT),key(DT))}
# the columns in DT1 only
notinR <- setdiff(nonkey(DT1), nonkey(DT2))
#calculate; .. means "up one level"
result <- DT2[DT1, .SD + .SD, roll= TRUE][,notinR := unclass(DT1[, ..notinR])]
# re set the column order to the original (DT1) order
setcolorder(result, names(DT1))
# voila!
result
rn 1 3 4 5 7 8
1: 1 0 0 2 0 0 0
2: 3 0 0 0 0 0 0
3: 4 2 0 0 0 0 0
4: 5 0 0 0 0 0 0
5: 7 0 0 0 0 1 0
6: 8 0 0 0 0 0 0
I'm not convinced this is a particularly stable solution, given that I'm not sure it isn't fluking the answer because M1 and M2 are subsets of eachother
Edit, an ugly approach using eval
This is made harder because you have non-syntatic names (`1` etc)
inBoth <- intersect(nonkey(DT1), nonKey(DT2))
backquote <- function(x){paste0('`', x, '`')}
bqBoth <- backquote(inBoth)
charexp <- sprintf('list(%s)',paste(c(paste0( bqBoth,'=', bqBoth, '+ i.',inBoth), backquote(notinR)), collapse = ','))
result2 <- DT2[DT1,eval(parse(text = charexp)), roll = TRUE]
setcolorder(result2, names(DT1))
# voila!
result2
rn 1 3 4 5 7 8
1: 1 0 0 2 0 0 0
2: 3 0 0 0 0 0 0
3: 4 2 0 0 0 0 0
4: 5 0 0 0 0 0 0
5: 7 0 0 0 0 1 0
6: 8 0 0 0 0 0 0

I think I managed to do it with this single disgusting line:
cast(aggregate(value ~ X1 + X2, rbind(melt(M1), melt(M2)), sum), X1 ~ X2)[,-1]
This makes use of the reshape package. Returned as a data frame so convert to matrix as necessary.
If you want it in the format you suggested in your example, try this:
"%ms%" <- function(m1, m2) {
m <- as.matrix(cast(aggregate(value ~ X1 + X2, rbind(melt(m1), melt(m2)), sum), X1 ~ X2)[,-1])
mn <- unique(colnames(m1), colnames(m2))
rownames(m) <- mn
colnames(m) <- mn
return (m)
}
Then you can do:
M1 %ms% M2
EDIT:
EXPLANATION
Obviously should have some explanation sorry.
melt(M1)
Converts M1 from its original form into a format like this (row, col, value). E.g.
1 3 4 5 7 8
1 0 0 1 0 0 0
3 0 0 0 0 0 0
4 1 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0
Is converted to:
X1 X2 value
1 1 1 0
2 3 1 0
3 4 1 1
etc. Combining M1 and M2 lists every possible (row, col, value) across both matrix into one single matrix. Now this:
aggregate(value ~ X1 + X2, rbind(melt(M1), melt(M2)), sum)
Sums values where the row and column are the same. So it will sum (1, 1) across both matrices for example. And (3, 1) etc. It won't do anything that doesn't exist e.g. M2 doesn't have a 7th column/row.
Finally cast transforms the matrix so that it is written with the result of aggregate's first column as rows, its second column as columns. Effectively undoing the melt from earlier. The [,-1] is taking off an unnecessary column leftover from cast (I think there is probably a better way of doing that but I don't know how).
As I said, it's returned as a data frame so use as.matrix() on the result if that's what you wish.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

What is the least memory demanding methods to do these steps? - r

Related

find peaks and valleys for each column of a dataframe

Alternative ways to create a repetitive vector in R

R with/without dplyr: create new columns as combinations of prev. columns per row

removing columns equal to 0 from multiple data frames in a list; lapply not actually removing columns when applying function to a list

Join and sum not compatible matrices

Categories

Resources