I have very large binary matrices with duplicated rows:
# create matrix
M = matrix(0, 10000, 50)
# randomly set up to two elements per row to one
for(ii in 1:nrow(M)){
M[ii, sample(1:ncol(M), 1)] = 1
M[ii, sample(1:ncol(M), 1)] = 1
}
I'm trying to find the index of the most frequently occuring row. The general problem of finding the most frequently occuring row is discussed for instance here and here. Another viable solution is to use paste0:
# string vector for each row
row_strings = apply(M, 1, paste0, collapse="")
# tabulate the strings
count_df = data.frame(table(row_strings))
# get indices of most frequently occuring string
which(row_strings == count_df$row_strings[which.max(count_df$Freq)])
However, these solutions are either slow or (relatively) complicated. I was wondering whether there is a more convenient, fast solution for this problem for the special case of binary matrices?
Coercing rows toString and table them.
ts <- apply(M, 1, toString)
pat <- names(sort(table(ts), decreasing=T)[1])
which(ts == pat)
# [1] 101 108 460 839 852
Data:
m <- 1e3
n <- 10
set.seed(42)
M <- matrix(sample(0:1, m*n, prob=c(1-.5, .5), replace=T), m, n)
Related
I want to generate an nxm matrix. Suppose its 100x3. I want each row to sum to 1 (so two "0"'s and one "1").
sample(c(0,0,1),3)
will give me 1 row but is there a very fast way to generate the whole matrix without an rbind?
Thank you!
No loops, no transposition. Just create a matrix of zeros and replace one entry per row with 1 by sampling the rows.
m <- matrix(0, 100, 3)
nr <- nrow(m)
m[cbind(1:nr, sample(ncol(m), nr, TRUE))] <- 1
all(rowSums(m) == 1)
# [1] TRUE
mat <- matrix(runif(300),ncol=3)
mat[] <- as.numeric(t(apply(mat, 1, function(r) r == max(r))))
t(apply(t(matrix(rep(c(0,0,1),300),nrow = 3)), 1, function(x) sample(x)))
Since you want single 1 for a row, the problem can be restated to select a column entry randomly that has 1 for each row.
So you can do like,
m <- 3; n<-100
rand_v <- floor(runif(n)*3)+1
mat <- matrix(0,n,m)
idx <- cbind(1:n,rand_v)
mat[idx] <- 1
Hope this helps.
I have a (13*122) x (14) matrix (122 stacked 13x14's), which I made into a list of 122 individual 13 x 14 matrices.
set.seed(1)
mat = matrix(rnorm(13*122*14,0,1),(13*122),14)
I have another matrix that is 122 x 14.
beta = matrix(rnorm(122*14,0,1),122,14)
I want to multiply each stacked matrix by the correspond row in beta, so the first 13 x 14 matrix would get multiplied by beta[1,] (which is 14x1), so I'd get 13x1 matrix, etc.
Should I do this with a list or is it unnecessary? I would like it to be as fast as possible.
I want to return a 13 x 122 matrix.
We could split the matrix into a 'list' of length '122' and use mapply to do the %*% of corresponding elements of 'lst' and rows of 'beta'
lst <- lapply(split(1:nrow(mat),(1:nrow(mat)-1) %/%13+1),
function(i) mat[i,])
res <- mapply(`%*%`, lst, split(beta, row(beta)))
dim(res)
#[1] 13 122
Or we could convert the matrix to array and then do the multiplication, which I guess would be fast
mat1 <- mat #if we need a copy of the original matrix
dim(mat1) <- c(13, 122, 14)
mat2 <- aperm(mat1, c(1,3,2))
res2 <- matrix(, ncol=122, nrow=13)
for(i in 1:(dim(mat2)[3])){
res2[,i] <- mat2[,,i] %*%beta[i,]
}
all.equal(res, res2, check.attributes=FALSE)
#[1] TRUE
Try this:
mat <- lapply(1:122, function(x) matrix(data = rnorm(13*14,0,1), nrow = 13, ncol = 14))
mat2 <- lapply(1:122, function(x) mat[[x]] %*% beta[x,])
look for the book introduction to algorithms and look at page 331. There is a pseodu algortihm to do so. you have to make a three of matrix products where it will sort it so that it will be an optimum for multiplication but short hand, if you have three matrices M1 of m x n, M2 of n x v, M3 of v x w then you wish to know if (M1 * M2) * M3 or M1 * (M2 * M3) is better the answer is to calculate the to numbers mnv and nvw and deside which is biggest. the smallest one is always better.
I have two vectors in R and want to generate a new matrix based on them.
a=c(1,2,1,2,3) # a[1] is 1: thus row 1, column 1 should be equal to...
b=c(10,20,30,40,50) # ...b[1], or 10.
I want to produce matrix 'v' BUT without my 'for' loop through columns of v and my multiplication:
v = as.data.frame(matrix(0,nrow=length(a),ncol=length(unique(a))))
for(i in 1:ncol(v)) v[[i]][a==i] <- 1 # looping through columns of 'v'
v <- v*b
I am sure there is a fast/elegant way to do it in R. At least of expanding 'a' into the earlier version of 'v' (before its multiplication by 'b').
Thanks a lot!
This is one way that sparse matrices can be defined.
Matrix::sparseMatrix(i = seq_along(a), j = a, x = b)
# Setup the problem:
set.seed(4242)
a <- sample(1:100, 1000000, replace = TRUE)
b <- sample(1:500, length(a), replace = TRUE)
# Start the timer
start.time <- proc.time()[3]
# Actual code
# We use a matrix instead of a data.frame
# The number of columns matches the largest column index in vector "a"
v <- matrix(0,nrow=length(a), ncol= max(a))
v[cbind(seq_along(a), a)] <- b
# Show elapsed time
stop.time <- proc.time()[3]
cat("elapsed time is: ", stop.time - start.time, "seconds.\n")
# For a million rows and a hundred columns, my prehistoric
# ... laptop says: elapsed time is: 2.597 seconds.
# these checks take much longer to run than the function itself
# Make sure the modified column in each row matches vector "a"
stopifnot(TRUE == all.equal(a, apply(v!=0, 1, which)))
# Make sure the modified value in each row equals vector "b"
stopifnot(TRUE == all.equal(rowSums(v), b))
I have a matrix temp1 (dimensions Nx16) (generally, NxM)
I would like to sum every k columns in each row to one value.
Here is what I got to so far:
cbind(rowSums(temp1[,c(1:4)]), rowSums(temp1[,c(5:8)]), rowSums(temp1[,c(9:12)]), rowSums(temp1[,c(13:16)]))
There must be a more elegant (and generalized) method to do it.
I have noticed similar question here:
sum specific columns among rows
couldn't make it work with Ananda's solution;
Got following error:
sapply(split.default(temp1, 0:(length(temp1)-1) %/% 4), rowSums)
Error in FUN(X[[1L]], ...) :
'x' must be an array of at least two dimensions
Please advise.
You can use by:
do.call(cbind, by(t(temp1), (seq(ncol(temp1)) - 1) %/% 4, FUN = colSums))
If the dimensions are equal for the sub matrices, you could change the dimensions to an array and then do the rowSums
m1 <- as.matrix(temp1)
n <- 4
dim(m1) <- c(nrow(m1), ncol(m1)/n, n)
res <- matrix(rowSums(apply(m1, 2, I)), ncol=n)
identical(res[,1],rowSums(temp1[,1:4]))
#[1] TRUE
Or if the dimensions are unequal
t(sapply(seq(1,ncol(temp2), by=4), function(i) {
indx <- i:(i+3)
rowSums(temp2[indx[indx <= ncol(temp2)]])}))
data
set.seed(24)
temp1 <- as.data.frame(matrix(sample(1:20, 16*4, replace=TRUE), ncol=16))
set.seed(35)
temp2 <- as.data.frame(matrix(sample(1:20, 17*4, replace=TRUE), ncol=17))
Another possibility:
x1<-sapply(1:(ncol(temp1)/4),function(x){rowSums(temp1[,1:4+(x-1)*4])})
## check
x0<-cbind(rowSums(temp1[,c(1:4)]), rowSums(temp1[,c(5:8)]), rowSums(temp1[,c(9:12)]), rowSums(temp1[,c(13:16)]))
identical(x1,x0)
# TRUE
Here's another approach. Convert the matrix to an array and then use apply with sum.
n <- 4
apply(array(temp1, dim=c(dim(temp1)/c(1,n), n)), MARGIN=c(1,3), FUN=sum)
Using #akrun's data
set.seed(24)
temp1 <- matrix(sample(1:20, 16*4, replace=TRUE), ncol=16)
a function which sums matrix columns with each group of size n columns
set.seed(1618)
mat <- matrix(rnorm(24 * 16), 24, 16)
f <- function(mat, n = 4) {
if (ncol(mat) %% n != 0)
stop()
cols <- split(colSums(mat), rep(1:(ncol(mat) / n), each = n))
## or use this to have n mean the number of groups you want
# cols <- split(colSums(mat), rep(1:n, each = ncol(mat) / n))
sapply(cols, sum)
}
f(mat, 4)
# 1 2 3 4
# -17.287137 -1.732936 -5.762159 -4.371258
c(sum(mat[,1:4]), sum(mat[,5:8]), sum(mat[,9:12]), sum(mat[,13:16]))
# [1] -17.287137 -1.732936 -5.762159 -4.371258
More examples:
## first 8 and last 8 cols
f(mat, 8)
# 1 2
# -19.02007 -10.13342
## each group is 16 cols, ie, the entire matrix
f(mat, 16)
# 1
# -29.15349
sum(mat)
# [1] -29.15349
Supposed that X contains 1000 rows with m columns, where m equal to 3 as follows:
set.seed(5)
X <- cbind(rnorm(1000,0,0.5), rnorm(1000,0,0.5), rnorm(1000,0,0.5))
Variable selection is performed, then the condition will be checked before performing the next operation as follows.
if(nrow(X) < 1000){print(a+b)}
,where a is 5 and b is 15, so if nrow(X) < 1000 is TRUE, then 20 will be printed out.
However, in case that X happens to be a vector because only one column is selected,
how can I check the number of data points when X can be either a matrix or vector ?
What I can think of is that
if(is.matrix(X)){
n <- nrow(X)
} else {
n <- length(X)}
if(n < 1000){print(a+b)}
Anyone has a better idea ?
Thank you
You can use NROW for both cases. From ?NROW
nrow and ncol return the number of rows or columns present in x. NCOL and NROW do the same treating a vector as 1-column matrix.
So that means that even if the subset is dropped down to a vector, as long as x is an array, vector, or data frame NROW will treat it as a one-column matrix.
sub1 <- X[,2:3]
is.matrix(sub1)
# [1] TRUE
NROW(sub1)
# [1] 1000
sub2 <- X[,1]
is.matrix(sub2)
# [1] FALSE
NROW(sub2)
# [1] 1000
So if(NROW(X) < 1000L) a + b should work regardless of whether X is a matrix or a vector. I use <= below, since X has exactly 1000 rows in your example.
a <- 5; b <- 15
if(NROW(sub1) <= 1000L) a + b
# [1] 20
if(NROW(sub2) <= 1000L) a + b
# [1] 20
A second option would be to use drop=FALSE when you make the variable selection. This will make the subset remain a matrix when the subset is only one column. This way you can use nrow with no worry. An example of this is
X[, 1, drop = FALSE]