Imputation mean in a matrix in R - r

I have on matrix in R with 440 rows and 261 columns.
There are some 0 values.
In each row I need to change the 0 values to the mean of all the values.
I tried to do it with the code below, but every time it changed with only the first mean value.
snp2<- read.table("snp2.txt",h=T)
mean <- rowMeans(snp2)
for(k in 1:nrow(snp2))
{
snp2[k==0]<-mean[k]
}

Instead of looping through the rows, you could do this in one shot by identifying all the 0 indices in the matrix and replacing them with the appropriate row mean:
# Sample data
(mat <- matrix(c(0, 1, 2, 1, 0, 3, 11, 11, 11), nrow=3))
# [,1] [,2] [,3]
# [1,] 0 1 11
# [2,] 1 0 11
# [3,] 2 3 11
(zeroes <- which(mat == 0, arr.ind=TRUE))
# row col
# [1,] 1 1
# [2,] 2 2
mat[zeroes] <- rowMeans(mat)[zeroes[,"row"]]
mat
# [,1] [,2] [,3]
# [1,] 4 1 11
# [2,] 1 4 11
# [3,] 2 3 11
While you could fix up your function to replace this missing values row-by-row, this will not be as efficient as the one-shot approach (in addition to being more typing):
josilber <- function(mat) {
zeroes <- which(mat == 0, arr.ind=TRUE)
mat[zeroes] <- rowMeans(mat)[zeroes[,"row"]]
mat
}
OP.fixed <- function(mat) {
means <- rowMeans(mat)
for(k in 1:nrow(mat)) {
mat[k,][mat[k,] == 0] <- means[k]
}
mat
}
bgoldst <- function(m) ifelse(m==0,rowMeans({ mt <- m; mt[mt==0] <- NA; mt; },na.rm=T)[row(m)],m);
# 4400 x 2610 matrix
bigger <- matrix(sample(0:10, 4400*2610, replace=TRUE), nrow=4400)
all.equal(josilber(bigger), OP.fixed(bigger))
# [1] TRUE
# bgoldst differs because it takes means of non-zero values only
library(microbenchmark)
microbenchmark(josilber(bigger), OP.fixed(bigger), bgoldst(bigger), times=10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# josilber(bigger) 262.541 382.0706 406.1107 395.3815 452.0872 532.4742 10
# OP.fixed(bigger) 1033.071 1184.7288 1236.6245 1238.8298 1271.7677 1606.6737 10
# bgoldst(bigger) 3820.044 4033.5826 4368.5848 4201.6302 4611.9697 5581.5514 10
For a fairly large matrix (4400 x 2610), the one-shot procedure is about 3 times quicker than the fixed up solution from the question and about 10 times faster than the one proposed by #bgoldst.

Here's a solution using ifelse(), assuming you want to exclude zeroes from the mean calculation:
NR <- 5; NC <- 5;
set.seed(1); m <- matrix(sample(c(rep(0,5),1:5),NR*NC,replace=T),NR);
m;
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 4 0 0 5
## [2,] 0 5 0 3 0
## [3,] 1 2 2 5 2
## [4,] 5 2 0 0 0
## [5,] 0 0 3 3 0
ifelse(m==0,rowMeans({ mt <- m; mt[mt==0] <- NA; mt; },na.rm=T)[row(m)],m);
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4.5 4 4.5 4.5 5.0
## [2,] 4.0 5 4.0 3.0 4.0
## [3,] 1.0 2 2.0 5.0 2.0
## [4,] 5.0 2 3.5 3.5 3.5
## [5,] 3.0 3 3.0 3.0 3.0

Related

Transform adjacency lists to binary matrix in R

Given a list of the locations of 1s in each row, I'm trying to find an efficient way to construct a binary matrix. Here's a small example, although I’m trying to find something that scales well -
Given a binary matrix:
> M <- matrix(rbinom(25,1,0.5),5,5)
> M
[,1] [,2] [,3] [,4] [,5]
[1,] 0 1 1 1 0
[2,] 0 1 1 1 1
[3,] 1 1 0 1 1
[4,] 1 0 0 1 0
[5,] 0 1 1 0 0
I can transform M into an adjacency list using:
> Mlist <- apply(M==1, 1, which, simplify = FALSE)
> Mlist
[[1]]
[1] 2 3 4
[[2]]
[1] 2 3 4 5
[[3]]
[1] 1 2 4 5
[[4]]
[1] 1 4
[[5]]
[1] 2 3
I'd like to transform Mlist back into M. One possibility is:
M.new <- matrix(0,5,5)
for (row in 1:5){M.new[row,Mlist[[row]]] <- 1}
But, it seems like there should be a more efficient way.
Thanks!
1) Using M and Mlist defined in the Note at the end, sapply over its components replacing a vector of zeros with ones at the needed locations. Transpose at the end.
M2 <- t(sapply(Mlist, replace, x = integer(length(Mlist)), 1L))
identical(M, M2) # check that M2 equals M
## [1] TRUE
2) A variation with slightly more keystrokes, but faster, would be
M3 <- do.call("rbind", lapply(Mlist, replace, x = integer(length(Mlist)), 1L))
identical(M, M3)
## [1] TRUE
Benchmark
Here ex1 and ex2 are (1) and (2) above and ex0 is the for loop in the question except we used integer instead of double. Note that (2) is about 100x faster then the loop in the question.
library(microbenchmark)
microbenchmark(
ex0 = { M.new <- matrix(0L,5,5); for (row in 1:5){M.new[row,Mlist[[row]]] <- 1L} },
ex1 = t(sapply(Mlist, replace, x = integer(length(Mlist)), 1L)),
ex2 = do.call("rbind", lapply(Mlist, replace, x = integer(length(Mlist)), 1L))
)
giving:
Unit: microseconds
expr min lq mean median uq max neval cld
ex0 4454.4 4504.15 4639.111 4564.1 4670.10 8450.2 100 b
ex1 73.1 84.75 98.220 94.3 111.75 130.8 100 a
ex2 32.0 36.20 43.866 42.7 51.85 82.5 100 a
Note
set.seed(123)
M <- matrix(rbinom(25,1,0.5),5,5)
Mlist <- apply(M==1, 1, which, simplify = FALSE)
Using the vectorized row/column indexing - replicate the sequence of 'Mlist' by the lengths of the 'Mlist', and cbind with the unlisted 'Mlist' to create a matrix which can be used to assign the subset of elements of 'M.new' to 1
ind <- cbind(rep(seq_along(Mlist), lengths(Mlist)), unlist(Mlist))
M.new[ind] <- 1
-checking
> all.equal(M, M.new)
[1] TRUE
Or another option is sparseMatrix
library(Matrix)
as.matrix(sparseMatrix(i = rep(seq_along(Mlist), lengths(Mlist)),
j = unlist(Mlist), x = 1))
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 1 1 1
[2,] 0 1 0 1 0
[3,] 1 0 0 1 0
[4,] 0 1 0 1 0
[5,] 1 0 1 1 1

How to vectorize a function

I have a 5x4 matrix. I have created a function call fun1, fun1 use double for loop to loop through the matrix and use distance function to work out the distance between two-row. The final results matrix will be a 5x5 matrix.
I am struggling to covert this fun1 to a vectorization function(no loop, only apply function).
x =
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
distance = function(a, b) {
sqrt(sum((a - b)^2))
}
fun1 = function(x) {
n = nrow(x)
results = matrix(0, nrow = n, ncol = n)
for (i in seq_len(n)) {
for (j in seq_len(n)) {
results[i,j] = distance(m[i,], m[j,])
}
}
results
}
You can do it with just a matrix multiplication, some additions and a transpose.
x <- matrix(1:20, nrow = 5)
z <- x %*% t(x)
sqrt(diag(z)+t(diag(z)-2*z))
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0 2 4 6 8
#> [2,] 2 0 2 4 6
#> [3,] 4 2 0 2 4
#> [4,] 6 4 2 0 2
#> [5,] 8 6 4 2 0
Interestingly this is faster than the in built method mentioned in the comments above!
mdist <- function(x) {
z <- x %*% t(x)
sqrt(diag(z)+t(diag(z)-2*z))
}
n <- 1000
l <- 100
x <- matrix(runif(n*l), ncol = l)
microbenchmark::microbenchmark(
z1 = as.matrix(dist(x)),
z2 = dist(x, diag = TRUE, upper = TRUE),
z3 = mdist(x),
times = 100
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> z1 82.98502 90.20049 98.54552 94.85027 101.78114 140.1809 100
#> z2 72.54279 76.22054 82.75410 79.31865 83.47765 231.3008 100
#> z3 54.58258 59.73461 65.62313 63.14435 67.49865 115.0379 100
In a pinch, Vectorize can do what you need:
outer(seq_len(nrow(m)), seq_len(nrow(m)),
Vectorize(function(i,j) distance(m[i,], m[j,]), vectorize.args=c("i","j")))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 2 4 6 8
# [2,] 2 0 2 4 6
# [3,] 4 2 0 2 4
# [4,] 6 4 2 0 2
# [5,] 8 6 4 2 0
Vectorize takes a function as an argument and returns a function that accepts vectors, iterating internally. The function passed to it is called once for each element within the vector passed. By default, Vectorize only vectorizes the first argument of the function, but it can "zip" along multiple arguments, assuming they are all the same length, by using vectorize.args=.
This might be a little easier to visualize by redefining distance:
distance_ind = function(i, j, data) {
sqrt(sum((data[i,] - data[j,])^2))
}
distance_ind(1, 2, m)
# [1] 2
distance_ind(c(1,3), c(2,3), m)
# [1] 2 ### wrong
distance_ind_vec <- Vectorize(distance_ind, vectorize.args = c("i", "j"))
distance_ind_vec(c(1,3), c(2,3), m)
# [1] 2 0
And the outer call:
outer(seq_len(nrow(m)), seq_len(nrow(m)), distance_ind_vec, data = m)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 2 4 6 8
# [2,] 2 0 2 4 6
# [3,] 4 2 0 2 4
# [4,] 6 4 2 0 2
# [5,] 8 6 4 2 0

create n*n matrix from n-1*n matrix by adding diagonal elements as 1 in R

For example I have a 2*3 matrix
[,1] [,2] [,3]
[1,] 2 4 6
[2,] 3 5 7
I want to have a 3*3 matrix inserting 1 in the diagonal In R
The output :
[,1] [,2] [,3]
[1,] 1 4 6
[2,] 2 1 7
[3,] 3 5 1
One option could be:
mat_new <- `diag<-`(matrix(ncol = ncol(mat), nrow = nrow(mat) + 1, 0), 1)
mat_new[mat_new == 0] <- mat
[,1] [,2] [,3]
[1,] 1 4 6
[2,] 2 1 7
[3,] 3 5 1
Or a variation on the original idea (proposed by #Henrik):
mat_new <- diag(ncol(mat))
mat_new[mat_new == 0] <- mat
Sample data:
mat <- structure(2:7, .Dim = 2:3, .Dimnames = list(c("[1,]", "[2,]"),
NULL))
Using append.
unname(mapply(function(x, y) append(x, 1, y), as.data.frame(m), 1:ncol(m) - 1))
# [,1] [,2] [,3]
# [1,] 1 4 6
# [2,] 2 1 7
# [3,] 3 5 1
Or using replace.
replace(diag(3), diag(3) < 1, m)
# [,1] [,2] [,3]
# [1,] 1 4 6
# [2,] 2 1 7
# [3,] 3 5 1
Data:
m <- structure(2:7, .Dim = 2:3)
In the case of your matrix you could play around upper and lower matrices. I include a code that could be useful:
#Input matrix
A <- matrix(c(2,4,6,3,5,7),nrow = 2,ncol = 3,byrow = T)
[,1] [,2] [,3]
[1,] 2 4 6
[2,] 3 5 7
#Output matrix
B <- matrix(0,nrow = 3,ncol = 3)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
Now we replace:
#Replace
B[upper.tri(B)] <- A[upper.tri(A)]
B[lower.tri(B)] <- A[lower.tri(A,diag = T)]
diag(B) <- 1
#Final output
B
The result:
[,1] [,2] [,3]
[1,] 1 4 6
[2,] 2 1 7
[3,] 3 5 1
I just benchmark the functions given from previous answerers:
add_diagonal <- function(mat) {
res <- diag(ncol(mat))
res[res == 0] <- mat
}
add_diagonal_1 <- function(mat) {
n <- max(dim(mat))
res <- matrix(0, nrow=n, ncol=n)
res[upper.tri(res)] <- mat[upper.tri(mat)]
res[lower.tri(res)] <- mat[lower.tri(mat)]
diag(res) <- 1
res
}
add_diagonal_2 <- function(mat) {
n <- max(dim(mat))
replace(diag(n), diag(n) < 1, mat)
}
add_diagonal_3 <- function(mat) {
unname(mapply(function(x, y) append(x, 1, y), as.data.frame(mat), 1:ncol(mat) - 1))
}
require(microbenchmark)
A <- matrix(c(2,4,6,3,5,7),nrow = 2,ncol = 3,byrow = T)
microbenchmark(add_diagonal(A), add_diagonal_1(A), add_diagonal_2(A), add_diagonal_3(A), times=10000)
The result:
Unit: microseconds
expr min lq mean median uq max neval
add_diagonal(A) 8.569 10.3865 13.17156 11.8440 14.4760 5256.301 10000
add_diagonal_1(A) 40.601 44.2130 51.68039 48.7940 51.7795 11519.797 10000
add_diagonal_2(A) 14.279 16.8790 20.60770 18.8860 21.7520 5966.649 10000
add_diagonal_3(A) 166.582 173.1480 189.50570 175.8495 179.2100 8586.079 10000
cld
a
c
b
d
As we see, the first function is the fastest, followed by the replace method.
As often, the apply functions are quite bad in performance.
Here is another base R option using diag + expand.grid + replace
replace(
diag(ncol(mat)),
as.matrix(subset(do.call(expand.grid, replicate(2, 1:ncol(mat), simplify = FALSE)), Var1 != Var2)),
mat
)
which gives
[,1] [,2] [,3]
[1,] 1 4 6
[2,] 2 1 7
[3,] 3 5 1

Creating a 5x5 matrix with 0's lined diagonally

In R, I want create a 5x5 matrix of 0,1,3,5,7 such that:
0 1 3 5 7
1 0 3 5 7
1 3 0 5 7
1 3 5 0 7
1 3 5 7 0
So obviously I can generate the starting matrix:
z <- c(0,1,3,5,7)
matrix(z, ncol=5, nrow=5, byrow = TRUE)
but I'm unsure of how to move the 0's position. I'm sure I have to use some sort of for/in loop, but I really don't know what exactly I need to do.
How about this:
m <- 1 - diag(5)
m[m==1] <- rep(c(1,3,5,7), each=5)
m
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 1 3 5 7
# [2,] 1 0 3 5 7
# [3,] 1 3 0 5 7
# [4,] 1 3 5 0 7
# [5,] 1 3 5 7 0
Or we can do:
z <- c(1,3,5,7)
mat <- 1-diag(5)
mat[mat==1] <- z
t(mat)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 1 3 5 7
# [2,] 1 0 3 5 7
# [3,] 1 3 0 5 7
# [4,] 1 3 5 0 7
# [5,] 1 3 5 7 0
Yet another solution just to enjoy combn as well:
r <- integer(5)
t(combn(5, 1, function(v) {r[v]<-0;r[-v]<-z;r}))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 1 3 5 7
# [2,] 1 0 3 5 7
# [3,] 1 3 0 5 7
# [4,] 1 3 5 0 7
# [5,] 1 3 5 7 0
Or using sapply:
v <- integer(5)
t(sapply(seq(5), function(x) {v[x]<-0;v[-x]<-z;v}))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 1 3 5 7
# [2,] 1 0 3 5 7
# [3,] 1 3 0 5 7
# [4,] 1 3 5 0 7
# [5,] 1 3 5 7 0
Here's a solution that builds the data vector with a couple of calls to rep(), a couple of calls to c(), a seq(), and an rbind(), and then wraps it in a call to matrix():
N <- 5L;
matrix(rep(c(0,rbind(seq(1,(N-1)*2,2),0)),rep(c(1,N),len=N*2-1)),N);
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 1 3 5 7
## [2,] 1 0 3 5 7
## [3,] 1 3 0 5 7
## [4,] 1 3 5 0 7
## [5,] 1 3 5 7 0
Another idea, using two calls to diag() and a cumsum():
N <- 5L;
(1-diag(N))*(cumsum(diag(N)*2)-1);
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 1 3 5 7
## [2,] 1 0 3 5 7
## [3,] 1 3 0 5 7
## [4,] 1 3 5 0 7
## [5,] 1 3 5 7 0
Benchmarking
Note: For the following benchmarking tests I modified everyone's solutions where necessary to ensure they are parameterized on the matrix size N. For the most part, this just involved replacing some literals with N, and replacing instances of c(1,3,5,7) with seq(1,(N-1)*2,2). I think this is fair.
library(microbenchmark);
josh <- function(N) { m <- 1-diag(N); m[m==1] <- rep(seq(1,(N-1)*2,2),each=N); m; };
marat <- function(N) matrix(rbind(0,col(diag(N))*2-1),nrow=N,ncol=N);
gregor <- function(N) { x = seq(1,(N-1)*2,2); t(mapply(FUN = append, after = c(0, seq_along(x)), MoreArgs = list(x = x, values = 0))); };
barkley <- function(N) { my_vec <- seq(1,(N-1)*2,2); my_val <- 0; my_mat <- matrix(NA, ncol = length(my_vec)+1, nrow = length(my_vec)+1); for (i in 1:nrow(my_mat)) { my_mat[i, i] <- my_val; my_mat[i, -i] <- my_vec; }; my_mat; };
m0h3n <- function(N) { z <- seq(1,(N-1)*2,2); mat=1-diag(N); mat[mat==1]=z; t(mat); };
bgoldst1 <- function(N) matrix(rep(c(0,rbind(seq(1,(N-1)*2,2),0)),rep(c(1,N),len=N*2-1)),N);
bgoldst2 <- function(N) (1-diag(N))*(cumsum(diag(N)*2)-1);
## small-scale: 5x5
N <- 5L;
ex <- josh(N);
identical(ex,marat(N));
## [1] TRUE
identical(ex,gregor(N));
## [1] TRUE
identical(ex,barkley(N));
## [1] TRUE
identical(ex,m0h3n(N));
## [1] TRUE
identical(ex,bgoldst1(N));
## [1] TRUE
identical(ex,bgoldst2(N));
## [1] TRUE
microbenchmark(josh(N),marat(N),gregor(N),barkley(N),m0h3n(N),bgoldst1(N),bgoldst2(N));
## Unit: microseconds
## expr min lq mean median uq max neval
## josh(N) 20.101 21.8110 25.71966 23.0935 24.8045 108.197 100
## marat(N) 5.987 8.1260 9.01131 8.5535 8.9820 24.805 100
## gregor(N) 49.608 51.9605 57.61397 53.8850 61.7965 98.361 100
## barkley(N) 29.081 32.0750 36.33830 33.7855 41.9110 54.740 100
## m0h3n(N) 22.666 24.8040 28.45663 26.0870 28.4400 59.445 100
## bgoldst1(N) 20.528 23.0940 25.49303 23.5220 24.8050 56.879 100
## bgoldst2(N) 3.849 5.1320 5.73551 5.5600 5.9880 16.251 100
## medium-scale: 50x50
N <- 50L;
ex <- josh(N);
identical(ex,marat(N));
## [1] TRUE
identical(ex,gregor(N));
## [1] TRUE
identical(ex,barkley(N));
## [1] TRUE
identical(ex,m0h3n(N));
## [1] TRUE
identical(ex,bgoldst1(N));
## [1] TRUE
identical(ex,bgoldst2(N));
## [1] TRUE
microbenchmark(josh(N),marat(N),gregor(N),barkley(N),m0h3n(N),bgoldst1(N),bgoldst2(N));
## Unit: microseconds
## expr min lq mean median uq max neval
## josh(N) 106.913 110.7630 115.68488 113.1145 116.1080 179.187 100
## marat(N) 62.866 65.4310 78.96237 66.7140 67.9980 1163.215 100
## gregor(N) 195.438 205.2735 233.66129 213.6130 227.9395 1307.334 100
## barkley(N) 184.746 194.5825 227.43905 198.6455 207.1980 1502.771 100
## m0h3n(N) 73.557 76.1230 92.48893 78.6885 81.6820 1176.045 100
## bgoldst1(N) 51.318 54.3125 95.76484 56.4500 60.0855 1732.421 100
## bgoldst2(N) 18.817 21.8110 45.01952 22.6670 23.5220 1118.739 100
## large-scale: 1000x1000
N <- 1e3L;
ex <- josh(N);
identical(ex,marat(N));
## [1] TRUE
identical(ex,gregor(N));
## [1] TRUE
identical(ex,barkley(N));
## [1] TRUE
identical(ex,m0h3n(N));
## [1] TRUE
identical(ex,bgoldst1(N));
## [1] TRUE
identical(ex,bgoldst2(N));
## [1] TRUE
microbenchmark(josh(N),marat(N),gregor(N),barkley(N),m0h3n(N),bgoldst1(N),bgoldst2(N));
## Unit: milliseconds
## expr min lq mean median uq max neval
## josh(N) 40.32035 43.42810 54.46468 45.36386 80.17241 90.69608 100
## marat(N) 41.00074 45.34248 54.74335 47.00904 50.74608 93.85429 100
## gregor(N) 33.65923 37.82393 50.50060 40.24914 75.09810 83.27246 100
## barkley(N) 31.02233 35.42223 43.08745 36.85615 39.81999 85.28585 100
## m0h3n(N) 27.08622 31.00202 38.98395 32.33244 34.33856 90.82652 100
## bgoldst1(N) 12.53962 13.02672 18.31603 14.92314 16.96433 59.87945 100
## bgoldst2(N) 13.23926 16.87965 28.81906 18.92319 54.60009 62.01258 100
## very large scale: 10,000x10,000
N <- 1e4L;
ex <- josh(N);
identical(ex,marat(N));
## [1] TRUE
identical(ex,gregor(N));
## [1] TRUE
identical(ex,barkley(N));
## [1] TRUE
identical(ex,m0h3n(N));
## [1] TRUE
identical(ex,bgoldst1(N));
## [1] TRUE
identical(ex,bgoldst2(N));
## [1] TRUE
microbenchmark(josh(N),marat(N),gregor(N),barkley(N),m0h3n(N),bgoldst1(N),bgoldst2(N));
## Unit: seconds
## expr min lq mean median uq max neval
## josh(N) 3.698714 3.908910 4.067409 4.046770 4.191938 4.608312 100
## marat(N) 6.440882 6.977273 7.272962 7.223293 7.493600 8.471888 100
## gregor(N) 3.546885 3.850812 4.032477 4.022563 4.221085 4.651799 100
## barkley(N) 2.955906 3.162409 3.324033 3.279032 3.446875 4.444848 100
## m0h3n(N) 3.355968 3.667484 3.829618 3.777151 3.973279 4.649226 100
## bgoldst1(N) 1.044510 1.260041 1.363827 1.369945 1.441194 1.819248 100
## bgoldst2(N) 1.144168 1.391711 1.517189 1.519653 1.629994 2.478636 100
Perhaps not the most beautiful solution ever, but maybe elegant in its simplicity:
my_vec <- c(1,3,5,7)
my_val <- 0
my_mat <- matrix(NA, ncol = length(my_vec)+1, nrow = length(my_vec)+1)
for (i in 1:nrow(my_mat)) {
my_mat[i, i] <- my_val
my_mat[i, -i] <- my_vec
}
my_mat
[,1] [,2] [,3] [,4] [,5]
[1,] 0 1 3 5 7
[2,] 1 0 3 5 7
[3,] 1 3 0 5 7
[4,] 1 3 5 0 7
[5,] 1 3 5 7 0
You could use
n <- 5
matrix(rbind(0,col(diag(n))*2-1),nrow=n,ncol=n)
Fun question! In poking around, I saw that append has a after argument.
x = c(1, 3, 5, 7)
t(mapply(FUN = append, after = c(0, seq_along(x)),
MoreArgs = list(x = x, values = 0)))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 1 3 5 7
# [2,] 1 0 3 5 7
# [3,] 1 3 0 5 7
# [4,] 1 3 5 0 7
# [5,] 1 3 5 7 0
Another option, directly constructing each row:
v = c(1, 3, 5, 7)
n = length(v)
t(sapply(0:n, function(i) c(v[0:i], 0, v[seq(to = n, length.out = n - i)])))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 1 3 5 7
#[2,] 1 0 3 5 7
#[3,] 1 3 0 5 7
#[4,] 1 3 5 0 7
#[5,] 1 3 5 7 0

adding successive four / n numbers in large matrix in R

I have very large dataset with dimension of 60K x 4 K. I am trying add every four values in succession in every row column wise. The following is smaller example dataset.
set.seed(123)
mat <- matrix (sample(0:1, 48, replace = TRUE), 4)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 0 1 1 1 0 1 1 0 1 1 0 0
[2,] 1 0 0 1 0 1 1 0 1 0 0 0
[3,] 0 1 1 0 0 1 1 1 0 0 0 0
[4,] 1 1 0 1 1 1 1 1 0 0 0 0
Here is what I am trying to perform:
mat[1,1] + mat[1,2] + mat[1,3] + mat[1,4] = 0 + 1 + 1 + 1 = 3
i.e. add every four values and output.
mat[1,5] + mat[1,6] + mat[1,7] + mat[1,8] = 0 + 1 + 1 + 0 = 2
Keep going to end of matrix (here to 12).
mat[1,9] + mat[1,10] + mat[1,11] + mat[1,12]
Once first row is done apply the same to second row, like:
mat[2,1] + mat[2,2] + mat[2,3] + mat[2,4]
mat[2,5] + mat[2,6] + mat[2,7] + mat[2,8]
mat[2,9] + mat[2,10] + mat[2,11] + mat[2,12]
The result will be nrow x (ncol)/4 matrix.
The expected result will look like:
col1-col4 col5-8 col9-12
row1 3 2 2
row2 2 2 1
row3 2 3 0
row4 3 4 0
Similarly for row 3 to number of rows in the matrix. How can I efficiently loop this.
While Matthew's answer is really cool (+1, btw), you can get a much (~100x) faster solution if you avoid apply and use the *Sums functions (in this case colSums), and a bit of vector manipulation trickery:
funSums <- function(mat) {
t.mat <- t(mat) # rows become columns
dim(t.mat) <- c(4, length(t.mat) / 4) # wrap columns every four items (this is what we want to sum)
t(matrix(colSums(t.mat), nrow=ncol(mat) / 4)) # sum our new 4 element columns, and reconstruct desired output format
}
set.seed(123)
mat <- matrix(sample(0:1, 48, replace = TRUE), 4)
funSums(mat)
Produces desired output:
[,1] [,2] [,3]
[1,] 3 2 2
[2,] 2 2 1
[3,] 2 3 0
[4,] 3 4 0
Now, let's make something the real size and compare against the other options:
set.seed(123)
mat <- matrix(sample(0:1, 6e5, replace = TRUE), 4)
funApply <- function(mat) { # Matthew's Solution
apply(array(mat, dim=c(4, 4, ncol(mat) / 4)), MARGIN=c(1,3), FUN=sum)
}
funRcpp <- function(mat) { # David's Solution
roll_sum(mat, 4, by.column = F)[, seq_len(ncol(mat) - 4 + 1)%%4 == 1]
}
library(microbenchmark)
microbenchmark(times=10,
funSums(mat),
funApply(mat),
funRcpp(mat)
)
Produces:
Unit: milliseconds
expr min lq median uq max neval
funSums(mat) 4.035823 4.079707 5.256517 7.5359 42.06529 10
funApply(mat) 379.124825 399.060015 430.899162 455.7755 471.35960 10
funRcpp(mat) 18.481184 20.364885 38.595383 106.0277 132.93382 10
And to check:
all.equal(funSums(mat), funApply(mat))
# [1] TRUE
all.equal(funSums(mat), funRcpp(mat))
# [1] TRUE
The key point is that the *Sums functions are fully "vectorized", in as much as all the calculations happen in C. apply still needs to do a bunch of not strictly vectorized (in the primitive C function way) stuff in R, and is slower (but far more flexible).
Specific to this problem, it might be possible to make it 2-3x faster as about half the time is spent on the transpositions, which are only necessary so that the dim changes do what I need for colSums to work.
Dividing the matrix up into a 3D array is one way:
apply(array(mat, dim=c(4, 4, 3)), MARGIN=c(1,3), FUN=sum)
# [,1] [,2] [,3]
# [1,] 3 2 2
# [2,] 2 2 1
# [3,] 2 3 0
# [4,] 3 4 0
Here's another approach using the RcppRoll package
library(RcppRoll) # Uses C++/Rcpp
n <- 4 # The summing range
roll_sum(mat, n, by.column = F)[, seq_len(ncol(mat) - n + 1) %% n == 1]
## [,1] [,2] [,3]
## [1,] 3 2 2
## [2,] 2 2 1
## [3,] 2 3 0
#3 [4,] 3 4 0
This might be the slowest of all:
set.seed(123)
mat <- matrix (sample(0:1, 48, replace = TRUE), 4)
mat
output <- sapply(seq(4,ncol(mat),4), function(i) { apply(mat,1,function(j){
sum(j[c(i-3, i-2, i-1, i)], na.rm=TRUE)
})})
output
[,1] [,2] [,3]
[1,] 3 2 2
[2,] 2 2 1
[3,] 2 3 0
[4,] 3 4 0
Maybe nested for-loops would be slower, but this answer is pretty close to being nested for-loops.

Resources