Related
I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.
I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1#i, j = summary(m1)$j, x = m1#x)
position_matrix_from_m2 <- rbind(i = m2#i, j = summary(m2)$j, x = m2#x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2#x in the sparse matrix object.
Is there a fundamental error in my approach? What should I do to run this efficiently?
Is there a fundamental error in my approach?
Yes. Here it is.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:
A <- matrix(1:9, 3, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
rn <- 1:2
cn <- 2:3
A[rn, cn]
# [,1] [,2]
#[1,] 4 7
#[2,] 5 8
What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:
A[cbind(rn, cn)]
#[1] 4 8
So you need to fix your code to:
m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0
Oh wait... Based on your construction of position_matrix_from_m1, this is just
ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0
Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
Now you can do:
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0
Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2#x according to m1#x.
ind <- m1#x > 0 & m1#x < 0.05
m2#x[ind] <- 1
m1#x[ind] <- 0
A complete R session
I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.
library(Matrix)
# Generating the example matrices.
set.seed(42)
## reduce problem size to what my laptop can bear with
squeeze <- 0.1
# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
## give me more usable RAM
rm(i, j, x1, x2)
##
## fix to your code
##
m1a <- m1
m2a <- m2
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0
##
## the best solution
##
m1b <- m1
m2b <- m2
ind <- m1#x > 0 & m1#x < 0.05
m2b#x[ind] <- 1
m1b#x[ind] <- 0
##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE
Caveat:
I know that some people may propose
m1c <- m1
m2c <- m2
logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0
It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.
I have a data frame with n columns and want to apply a function to each combination of columns. This is very similar to how the cor() function takes a data frame as input and produces a correlation matrix as output, for example:
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
cor(X)
Which will generate this output:
> cor(X)
A B C
A 1.00000000 -0.01199511 0.02337429
B -0.01199511 1.00000000 0.07918920
C 0.02337429 0.07918920 1.00000000
However, I have a custom function that I need to apply to each combination of columns. I am now using a solution that uses nested for loops, which works:
f <- function(x, y) sum((x+y)^2) # some placeholder function
out <- matrix(NA, ncol = ncol(X), nrow = ncol(X)) # pre-allocate
for(i in seq_along(X)) {
for(j in seq_along(X)) {
out[i, j] <- f(X[, i], X[, j]) # apply f() to each combination
}
}
Which produces:
> out
[,1] [,2] [,3]
[1,] 422.4447 207.0833 211.4198
[2,] 207.0833 409.1242 218.2430
[3,] 211.4198 218.2430 397.5321
I am currently trying to transition into the tidyverse and would prefer to avoid using for loops. Could someone show me a tidy solution for this situation? Thanks!
You could do
library(tidyverse)
f <- function(x, y) sum((x+y)^2)
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
as.list(X) %>%
expand.grid(., .) %>%
mutate(out = map2_dbl(Var1, Var2, f)) %>%
as_tibble()
This isn’t a tidyverse solution, but it does avoid using for loops. We use RcppAlgos (I am the author) to generate all pair-wise permutations of columns and apply your custom function to each of these. After that, we coerce to a matrix.
set.seed(42)
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
library(RcppAlgos)
matrix(permuteGeneral(ncol(X), 2, repetition = TRUE, FUN = function(y) {
sum((X[,y[1]] + X[,y[2]])^2)
}), ncol = ncol(X))
# [,1] [,2] [,3]
# [1,] 429.8549 194.4271 179.4449
# [2,] 194.4271 326.8032 197.2585
# [3,] 179.4449 197.2585 409.6313
Using base R you could do:
set.seed(42)
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
OUT = diag(colSums((X+X)^2))
OUT[lower.tri(OUT)] = combn(X, 2, function(x) sum(do.call('+', x)^2)) #combn(X,2,function(x)sum(rowSums(x)^2))
OUT[upper.tri(OUT)] = OUT[lower.tri(OUT)]
OUT
[,1] [,2] [,3]
[1,] 429.8549 194.4271 179.4449
[2,] 194.4271 326.8032 197.2585
[3,] 179.4449 197.2585 409.6313
Does anyone know a fast way to create a matrix
like the following one in R.
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 1 2 2 2
[3,] 1 2 3 3
[4,] 1 2 3 4
The matrix above is 4x4 and I want to create something like 10000x10000.
You can do:
N <- 4
m <- matrix(nrow = N, ncol = N)
m[] <- pmin.int(col(m), row(m))
or a shorter version as suggested by #dickoa:
m <- outer(1:N, 1:N, pmin.int)
These also work and are both faster:
m <- pmin.int(matrix(1:N, nrow = N, byrow = TRUE),
matrix(1:N, nrow = N, byrow = FALSE))
m <- matrix(pmin.int(rep(1:N, each = N), 1:N), nrow = N)
Finally, here is a cute one using a matrix product but it is rather slow:
x <- matrix(1, N, N)
m <- lower.tri(x, diag = TRUE) %*% upper.tri(x, diag = TRUE)
Note that a 10k-by-10k matrix for R seems big, I hope you don't run out of memory.
I am trying to get a function that is the opposite of diff()
I want to add the values of adjacent columns in a matrix for each column in the matrix.
I do NOT need the sum of the entire column or row.
For example:
If I had:
[ 1 2 4;
3 5 8 ]
I would end up with:
[ 3 6;
8 13 ]
Of course for just one or two columns this is simple as I can just do x[,1]+x[,2], but these matrices are quite large.
I'm surprised that I cannot seem to find an efficient way to do this.
m <- matrix(c(1,3,2,5,4,8), nrow=2)
m[,-1] + m[,-ncol(m)]
[,1] [,2]
[1,] 3 6
[2,] 8 13
Or, just for the fun of it:
n <- ncol(m)
x <- suppressWarnings(matrix(c(1, 1, rep(0, n-1)),
nrow = n, ncol = n-1))
m %*% x
[,1] [,2]
[1,] 3 6
[2,] 8 13
Dummy data
mat <- matrix(sample(0:9, 100, replace = TRUE), nrow = 10)
Solution:
sum.mat <- lapply(1:(ncol(mat)-1), function(i) mat[,i] + mat[,i+1])
sum.mat <- matrix(unlist(sum.mat), byrow = FALSE, nrow = nrow(mat))
You could use:
m <- matrix(c(1,2,4,3,5,8), nrow=2, byrow=T)
sapply(2:ncol(m), function(x) m[,x] + m[,(x-1)])
I have 2 matrices.
The first one:
[1,2,3]
and the second one:
[3,1,2
2,1,3
3,2,1]
I'm looking for a way to multiply them.
The result is supposed to be: [11, 13, 10]
In R, mat1%*%mat2 don't work.
You need the transpose of the second matrix to get the result you wanted:
> v1 <- c(1,2,3)
> v2 <- matrix(c(3,1,2,2,1,3,3,2,1), ncol = 3, byrow = TRUE)
> v1 %*% t(v2)
[,1] [,2] [,3]
[1,] 11 13 10
Or potentially quicker (see ?crossprod) if the real problem is larger:
> tcrossprod(v1, v2)
[,1] [,2] [,3]
[1,] 11 13 10
mat1%%mat2 Actuall y works , this gives [ 16 9 11 ]
but you want mat1 %% t(mat2). This means transpose of second matrix, then u can get [11 13 10 ]
Rcode:
mat1 = matrix(c(1,2,3),nrow=1,ncol=3,byrow=TRUE)
mat2 = matrix(c(3,1,2,2,1,3,3,2,1), nrow=3,ncol=3,byrow=TRUE)
print(mat1)
print(mat2 )
#matrix Multiplication
print(mat1 %*% mat2 )
# matrix multiply with second matrix with transpose
# Note of using function t()
print(mat1 %*% t(mat2 ))
It's difficult to say what the best answer here is because the notation in the question isn't in R, it's in matlab. It's hard to tell if the questioner wants to multiple a vector, 1 row matrix, or 1 column matrix given the mixed notation.
An alternate answer to this question is simply switch the order of the multiplication.
v1 <- c(1,2,3)
v2 <- matrix(c(3,1,2,2,1,3,3,2,1), ncol = 3, byrow = TRUE)
v2 %*% v1
This yields an answer that's a single column rather than a single row matrix.
try this one
x<-c()
y<-c()
for(i in 1:9)
{
x[i]<-as.integer(readline("Enter number for 1st matrix"))
}
for(i in 1:9)
{
y[i]<-as.integer(readline("Enter number for 2nd matrix"))
}
M1 <- matrix(x, nrow=3,ncol = 3, byrow=TRUE)
M2 <- matrix(y, nrow=3,ncol = 3, byrow=TRUE)
print(M1%*%M2)