applying rmultinom in R without iterating over matrix - r

I want to use rmultinom(), combined with a transition matrix, to generate whole number outputs that, when summed, are equal to the original values. However, I can't figure out how to do it without iterating over the matrix. Here is an example:
a = matrix(runif(16),nrow=4,ncol=4)
a = apply(a,2,FUN = function(x) x/sum(x))
b = c(5,7,5,9)
out = c(0,0,0,0) # initialize
for (i in 1:ncol(a)){
tmp = rmultinom(1,b[i],a[,i])
out = tmp + out
}
sum(out) == sum(b) ## Should eval to true
a represents a transition matrix, with each column summing to 1. b is a starting vector of integers. The loop iterates along the columns to generate a vector in out that sums to the initial numbers in b. How can I do this without using a loop? The results would be similar to if I multiply a %*% b, but this leaves me with floating point values.

You could do apply and rowSums (this will be stochastic):
library(magrittr)
set.seed(1)
a = matrix(runif(16),nrow=4,ncol=4)
a = apply(a,2,FUN = function(x) x/sum(x))
b <- c(5,7,5,9)
out <- purrr::map(1:4, ~rmultinom(1, b[.x], a[,.x])) %>%
unlist() %>%
matrix(nrow = 4) %>%
rowSums()
out
[1] 7 7 9 3
sum(out)
[1] 26
sum(b)
[1] 26

Related

How to modify non-zero elements of a large sparse matrix based on a second sparse matrix in R

I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.
I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1#i, j = summary(m1)$j, x = m1#x)
position_matrix_from_m2 <- rbind(i = m2#i, j = summary(m2)$j, x = m2#x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2#x in the sparse matrix object.
Is there a fundamental error in my approach? What should I do to run this efficiently?
Is there a fundamental error in my approach?
Yes. Here it is.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:
A <- matrix(1:9, 3, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
rn <- 1:2
cn <- 2:3
A[rn, cn]
# [,1] [,2]
#[1,] 4 7
#[2,] 5 8
What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:
A[cbind(rn, cn)]
#[1] 4 8
So you need to fix your code to:
m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0
Oh wait... Based on your construction of position_matrix_from_m1, this is just
ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0
Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
Now you can do:
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0
Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2#x according to m1#x.
ind <- m1#x > 0 & m1#x < 0.05
m2#x[ind] <- 1
m1#x[ind] <- 0
A complete R session
I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.
library(Matrix)
# Generating the example matrices.
set.seed(42)
## reduce problem size to what my laptop can bear with
squeeze <- 0.1
# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
## give me more usable RAM
rm(i, j, x1, x2)
##
## fix to your code
##
m1a <- m1
m2a <- m2
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0
##
## the best solution
##
m1b <- m1
m2b <- m2
ind <- m1#x > 0 & m1#x < 0.05
m2b#x[ind] <- 1
m1b#x[ind] <- 0
##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE
Caveat:
I know that some people may propose
m1c <- m1
m2c <- m2
logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0
It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.

Replace a specific row depending on input in a matrix with zeros

I want to create a function which replaces the a chosen row of a matrix with zeros. I try to think of the matrix as arbitrary but for this example I have done it with a sample 3x3 matrix with the numbers 1-9, called a_matrix
1 4 7
2 5 8
3 6 9
I have done:
zero_row <- function(M, n){
n <- c(0,0,0)
M*n
}
And then I have set the matrix and tried to get my desired result by using my zero_row function
mat1 <- a_matrix
zero_row(M = mat1, n = 1)
zero_row(M = mat1, n = 2)
zero_row(M = mat1, n = 3)
However, right now all I get is a matrix with only zeros, which I do understand why. But if I instead change the vector n to one of the following
n <- c(0,1,1)
n <- c(1,0,1)
n <- c(1,1,0)
I get my desired result for when n=1, n=2, n=3 separately. But what i want is, depending on which n I put in, I get that row to zero, so I have a function that does it for every different n, instead of me having to change the vector for every separate n. So that I get (n=2 for example)
1 4 7
0 0 0
3 6 9
And is it better to do it in another form, instead of using vectors?
Here is a way.
zero_row <- function(M, n){
stopifnot(n <= nrow(M))
M[n, ] <- 0
M
}
A <- matrix(1:9, nrow = 3)
zero_row(A, 1)
zero_row(A, 2)
zero_row(A, 3)

Create a vector with the sum of the positive elements of each column of a m*n numeric matrix in R

I need to create an R function which takes a numeric matrix A of arbitrary format n*m as input and returns a vector that is as long as A's number of columns that contains the sum of the positive elements of each column.
I must do this in 2 ways - the first in a nested loop and the second as a one liner using vector/matrix operations.
So far I have come up with the following code which creates the vector the size of the amounts of columns of matrix A but I can only seem to get it to give me the sum of all positive elements of the matrix instead of each column:
colSumPos(A){
columns <- ncol(A)
v1 <- vector("numeric", columns)
for(i in 1:columns)
{
v1[i] <- sum(A[which(A>0)])
}
}
Could someone please explain how I get the sum of each column separately and then how I can simplify the code to dispose of the nested loop?
Thanks in advance for the help!
We can use apply with MARGIN=2 to loop through the columns and get the sum of elements that are greater than 0
apply(A, 2, function(x) sum(x[x >0], na.rm = TRUE))
#[1] 1.8036685 0.7129192 0.9305136 2.6625824 0.0000000
Or another option is colSums after replacing the values less than or equal to 0 with NA
colSums(A*NA^(A<=0), na.rm = TRUE)
#[1] 1.8036685 0.7129192 0.9305136 2.6625824 0.0000000
Or by more direct approach
colSums(replace(A, A<=0, NA), na.rm = TRUE)
#[1] 1.8036685 0.7129192 0.9305136 2.6625824 0.0000000
Or if there are no NA elements (no need for na.rm=TRUE), we can replace the values that are less than or equal to 0 with 0 and make it compact (as #ikop commented)
colSums(A*(A>0))
#[1] 1.8036685 0.7129192 0.9305136 2.6625824 0.0000000
data
set.seed(24)
A <- matrix(rnorm(25), 5, 5)
You try code folow if you using for loop
sumColum <- function(A){
for(i in 1:nrow(A)){
for(j in 1:ncol(A)){
colSums(replace(A, A<=0, NA), na.rm = TRUE)
}
}
colSums(A)
}

R: Expand a vector of matrix column numbers into a matrix with those columns filled

I have two vectors in R and want to generate a new matrix based on them.
a=c(1,2,1,2,3) # a[1] is 1: thus row 1, column 1 should be equal to...
b=c(10,20,30,40,50) # ...b[1], or 10.
I want to produce matrix 'v' BUT without my 'for' loop through columns of v and my multiplication:
v = as.data.frame(matrix(0,nrow=length(a),ncol=length(unique(a))))
for(i in 1:ncol(v)) v[[i]][a==i] <- 1 # looping through columns of 'v'
v <- v*b
I am sure there is a fast/elegant way to do it in R. At least of expanding 'a' into the earlier version of 'v' (before its multiplication by 'b').
Thanks a lot!
This is one way that sparse matrices can be defined.
Matrix::sparseMatrix(i = seq_along(a), j = a, x = b)
# Setup the problem:
set.seed(4242)
a <- sample(1:100, 1000000, replace = TRUE)
b <- sample(1:500, length(a), replace = TRUE)
# Start the timer
start.time <- proc.time()[3]
# Actual code
# We use a matrix instead of a data.frame
# The number of columns matches the largest column index in vector "a"
v <- matrix(0,nrow=length(a), ncol= max(a))
v[cbind(seq_along(a), a)] <- b
# Show elapsed time
stop.time <- proc.time()[3]
cat("elapsed time is: ", stop.time - start.time, "seconds.\n")
# For a million rows and a hundred columns, my prehistoric
# ... laptop says: elapsed time is: 2.597 seconds.
# these checks take much longer to run than the function itself
# Make sure the modified column in each row matches vector "a"
stopifnot(TRUE == all.equal(a, apply(v!=0, 1, which)))
# Make sure the modified value in each row equals vector "b"
stopifnot(TRUE == all.equal(rowSums(v), b))

Converting between matrix subscripts and linear indices (like ind2sub/sub2ind in matlab)

Let's say you have a matrix
m <- matrix(1:25*2, nrow = 5, ncol=5)
How do you go from matrix subscripts (row index, column index) to a linear index you can use on the matrix. For example you can extract values of the matrix with either of these two methods
m[2,3] == 24
m[12] == 24
How do you go from (2,3) => 12 or 12 => (2,3) in R
In Matlab the functions you would use for converting matrix subscripts to linear indices and vice versa are ind2sub and `sub2ind
Is there an equivalent way in R?
This is not something I've used before, but according to this handy dandy Matlab to R cheat sheet, you might try something like this, where m is the number of rows in the matrix, r and c are row and column numbers respectively, and ind the linear index:
MATLAB:
[r,c] = ind2sub(size(A), ind)
R:
r = ((ind-1) %% m) + 1
c = floor((ind-1) / m) + 1
MATLAB:
ind = sub2ind(size(A), r, c)
R:
ind = (c-1)*m + r
For higher dimension arrays, there is the arrayInd function.
> abc <- array(dim=c(10,5,5))
> arrayInd(12,dim(abc))
dim1 dim2 dim3
[1,] 2 2 1
You mostly don't need those functions in R. In Matlab you need those because you can't do e.g.
A(i, j) = x
where i,j,x are three vectors of row and column indices and x contains the corresponding values. (see also this question)
In R you can simply:
A[ cbind(i, j) ] <- x
There are row and col functions that return those indices in matrix form. So it should be as simple as indexing the return from those two functions:
M<- matrix(1:6, 2)
row(M)[5]
#[1] 1
col(M)[5]
#[1] 3
rc.ind <- function(M, ind) c(row(M)[ind], col(M)[ind] )
rc.ind(M,5)
[1] 1 3
Late answer but there's an actual function for ind2sub in the base package called arrayInd
m <- matrix(1:25, nrow = 5, ncol=5)
# linear indices in R increase row number first, then column
arrayInd(5, dim(m))
arrayInd(6, dim(m))
# so, for any arbitrary row/column
numCol <- 3
numRow <- 4
arrayInd(numRow + ((numCol-1) * nrow(m)), dim(m))
# find the row/column of the maximum element in m
arrayInd(which.max(m), dim(m))
# actually which has an arr.ind parameter for returning array indexes
which(m==which.max(m), arr.ind = T)
For sub2ind, JD Long's answer seems to be the best
Something like this works for arbitrary dimensions-
ind2sub = function(sz,ind)
{
ind = as.matrix(ind,ncol=1);
sz = c(1,sz);
den = 1;
sub = c();
for(i in 2:length(sz)){
den = den * sz[i-1];
num = den * sz[i];
s = floor(((ind-1) %% num)/den) + 1;
sub = cbind(sub,s);
}
return(sub);
}

Resources