R: Efficient way to convert factor into binary matrix - r

I'd like to convert a size n factor into a n×n binary matrix whose (i, j) element is 1 if i-th and j-th element of factor are same and 0 otherwise.
The following is a naive way to implement what I want to do but this code is quite slow. Is there any more efficient way to do the same thing?
size <- 100
id <- factor(sample(3, size, replace=TRUE))
mat <- matrix(0, nrow=size, ncol=size)
for(i in 1:size){
for(j in 1:size){
if(id[i] == id[j]){
mat[i, j] <- 1
}
}
}

Another alternative, which should be relatively fast
tcrossprod(model.matrix( ~ id + 0))
Similarly to Hong Ooi's answer you can use also sparse matrices
library(Matrix)
tcrossprod(sparse.model.matrix( ~ id + 0))

outer can be used for this.
mat <- outer(id, id, "==")
Since the output is a binary matrix, and O(N^2) objects are kind of large, this is a good use case for sparse matrices:
library(Matrix)
mat <- Matrix(nrow=100, ncol=100)
mat[] <- outer(id, id, "==") # [] means to assign into the existing 'mat' matrix

Related

R - How can I make this loop faster?

Is there some way to make this loop faster in r?
V=array(NA, dim=c(nrow(pixDF), n))
for(i in 1:n)
{
sdC<-sqrt(det(Cov[,i,]))
iC<-inv(Cov[,i,])
V[,i]<-apply(pixDF,1,function(x)(sdC*exp(-0.5*((x-Mean[i,])%*%iC%*%as.matrix((x-Mean[i,]))))))
}
where, in this case, pixDF is a matrix with 490000 rows and 4 columns filled with doubles. n = 5. Cov is a (4,5,4) array filled with "doubles". Mean is a (5,4) array filled with doubles as well.
This loop was taking about 30min on my computer. (before editing).
Right now it's taking 1min.
As Ronak notes, it is hard to help without reproducible example. But, I think that apply could be avoided. Something like this COULD work:
V <- array(NA, dim = c(nrow(pixDF), n))
tpixDF <- t(pixDF)
for (i in 1:n) {
x <- Cov[, i, ]
sdC <- sqrt(det(x))
iC <- solve(x)
mi <- Mean[i, ]
k <- t(tpixDF - mi)
V[, i] <- sdC*exp(-0.5*rowSums(k %*% iC * k))
}
Also, as Roland mentions inv probably is equal solve.

R: Stuck on a "simple" problem: calculating total sum of squares in a n*m matrix

Given a data matrix with n rows and m columns, I would like to calculate the total sum of squares in R.
For this I've tried a loop that iterates through the rows of each column and saves the results in a vector. These are then added to the "TSS" vector where each value is the SS of one column. The sum of this vector should be the TSS.
set.seed(2020)
m <- matrix(c(sample(1:100, 80)), nrow = 40, ncol = 2)
tss <- c()
for(j in 1:ncol(m)){
tssVec <- c()
for(i in 1:nrow(m)){
b <- sum(((m[i,]) - mean(m[,j]))^2)
tssVec <- c(tssVec, b)
}
tss <- c(tss, sum(tssVec))
}
sum(tss)
The output is equal to 136705.6. This is not feasible at all. As a novice coder, I am unfortunately stuck.
Any help is appreciated!
There are many methods to evaluate the TSS, of course they will give you the same result. I would do something like:
Method 1 that implies the use of ANOVA:
n <- as.data.frame(m)
mylm <- lm(n$V1 ~ n$V2)
SSTotal <-sum(anova(mylm)[,2])
Method 2:
SSTotal <- var( m[,1] ) * (nrow(m)-1)

R: Generate matrix from function

In R I'm interested in the general case to generate a matrix from a formula such as:
X = some other matrix
Y(i, j) = X(i, j) + Y(i - 1, j - 1)
Unfortunately I can't find how to account for the matrix self-referencing.
Obviously order of execution and bounds checking are factors here, but I imagine these could be accounted for by the matrix orientation and formula respetively.
Thanks.
This solution assumes that you want Y[1,n] == X[1,n] and Y[n,1] == X[n,1]. If not, you can apply the same solution on the sub-matrix X[-1,-1] to fill in the values of Y[-1,-1]. It also assumes that the input matrix is square.
We use the fact that Y[N,N] = X[N,N] + X[N-1, N-1] + ... + X[1,1] plus similar relations for off-diagonal elements. Note that off-diagonal elements are a diagonal of a particular sub-matrix.
# Example input
X <- matrix(1:16, ncol=4)
Y <- matrix(0, ncol=ncol(X), nrow=nrow(X))
diag(Y) <- cumsum(diag(X))
Y[1,ncol(X)] <- X[1,ncol(X)]
Y[nrow(X),1] <- X[nrow(X),1]
for (i in 1:(nrow(X)-2)) {
ind <- seq(i)
diag(Y[-ind,]) <- cumsum(diag(X[-ind,])) # lower triangle
diag(Y[,-ind]) <- cumsum(diag(X[,-ind])) # upper triangle
}
Well, you can always use a for loop:
Y <- matrix(0, ncol=3, nrow=3)
#boundary values:
Y[1,] <- 1
Y[,1] <- 2
X <- matrix(1:9, ncol=3)
for (i in 2:nrow(Y)) {
for (j in 2:ncol(Y)) {
Y[i, j] <- X[i, j] + Y[i-1, j-1]
}
}
If that is too slow you can translate it to C++ (using Rcpp) easily.

Efficiently Load A Sparse Matrix in R

I'm having trouble efficiently loading data into a sparse matrix format in R.
Here is an (incomplete) example of my current strategy:
library(Matrix)
a1=Matrix(0,5000,100000,sparse=T)
for(i in 1:5000)
a1[i,idxOfCols]=x
Where x is usually around length 20. This is not efficient and eventually slows to a crawl. I know there is a better way but wasn't sure how. Suggestions?
You can populate the matrix all at once:
library(Matrix)
n <- 5000
m <- 1e5
k <- 20
idxOfCols <- sample(1:m, k)
x <- rnorm(k)
a2 <- sparseMatrix(
i=rep(1:n, each=k),
j=rep(idxOfCols, n),
x=rep(x, k),
dims=c(n,m)
)
# Compare
a1 <- Matrix(0,5000,100000,sparse=T)
for(i in 1:n) {
a1[i,idxOfCols] <- x
}
sum(a1 - a2) # 0
You don't need to use a for-loop. Yu can just use standard matrix indexing with a two column matrix:
a1[ cbind(i,idxOfCols) ] <- x

create a vector from outputs

I have the following code in R:
z <- scale(x) / sqrt(n-1) # standardized matrix x such that z'z=correlation matrix
R <- t(z) %*% z # correlation matrix
I <- diag(py - 1) # identity matrix(py defined before)
df <- rep(0, length(k)) # k=seq(0,5,0.001)
for (i in seq(0,5,0.001)) {
H <- z %*% solve(R+(i*I)) %*% t(z)
tr <- sum(diag(H))
df <- c(df,tr) ## problem here
}
The last line in the code is not good, as what I want is a vector (df) that reads each number from tr for each i, so that df returns a vector containing all tr.
Any help is appreciated.
Thanks
Separate the points that you want to solve at from the loop index.
solve_points <- seq(0,5,0.001)
for(i in seq_along(solve_points))
{
H=z%*%solve(R+(solve_points*I))%*%t(z)
tr=sum(diag(H))
df[i] <- tr
You want to fill in the vector df, not concatenate it all the time. That will slow R down a lot as it has to copy the object each iteration of the loop.
I think you probably want something like this:
for (i in seq_along(k)) { ## loop over 1:length(k)
H <- z %*% solve(R+(k[i]*I)) %*% t(z) ## use i to index into k
tr <- sum(diag(H))
df[i] <- tr ## add `tr` to the ith element of df
}
but a reproducible example would have helped. For example, you might not need to index k, depends on what your code is really doing and you don;t provide all the objects to check.

Resources