Generate a Matrix with Column Sum Constraint in R - r

For example, I can easily specify an arbitrary matrix like
x <- matrix(c(10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25),
nrow = 4, ncol = 6, byrow = TRUE)
But in an optimization problem, I am required to use a matrix x under constraints max(colSums(x)) <= 1. How can I generate such a matrix?

OK, I will put my comment into an answer.
x <- matrix(c(10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25),
nrow = 4, ncol = 6, byrow = TRUE)
alpha <- colSums(x) + runif(ncol(x), 0, abs(mean(x)))
y <- t(t(x) / alpha)
> x
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 10 0.25 0.25 0.25 0.25 0.25
[2,] 10 0.25 0.25 0.25 0.25 0.25
[3,] 10 0.25 0.25 0.25 0.25 0.25
[4,] 10 0.25 0.25 0.25 0.25 0.25
> y
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.2397915 0.1814021 0.09312788 0.09021147 0.1116572 0.1147005
[2,] 0.2397915 0.1814021 0.09312788 0.09021147 0.1116572 0.1147005
[3,] 0.2397915 0.1814021 0.09312788 0.09021147 0.1116572 0.1147005
[4,] 0.2397915 0.1814021 0.09312788 0.09021147 0.1116572 0.1147005
> alpha ## random scaling factor
[1] 41.702890 1.378154 2.684481 2.771266 2.238996 2.179589
> colSums(y)
[1] 0.9591661 0.7256085 0.3725115 0.3608459 0.4466288 0.4588021
I did not set seed. So when you run, the result will be different.
Follow-up
Perfect thanks a lot. Just one additional comment. With this method, can I control specific column sum to be <=1, i.e, what if I don't want the constraint for the first column but for others!
You can manually set alpha[1] to be 1, so that essentially no scaling is done.
alpha <- colSums(x) + runif(0, abs(mean(x)), ncol(x))
alpha[1] <- 1
Then proceed. If you want 3rd column not affected as well, do
alpha[c(1,3)] <- 1
random seed
Compare:
runif(4)
## [1] 0.1300864 0.5689224 0.5594464 0.8778068
runif(4)
## [1] 0.3320244 0.5838847 0.0466045 0.2594867
But if we set seed:
set.seed(0); runif(4)
## [1] 0.8966972 0.2655087 0.3721239 0.5728534
set.seed(0); runif(4)
## [1] 0.8966972 0.2655087 0.3721239 0.5728534
Without using seed, random numbers are not reproducible. I did not put set.seed(0) before alpha, so when you run the code, you will see different alpha, y.
The seed can be any integers, not necessarily 0. For example, both -2016 and 2016 works. But for different seed, you get different set of random numbers. However, whenever you use the same seed, the random numbers are the same.

Related

How to repeat in R

I am a newbie in R, now I have a vector H(0.6,0.045,3), I want to create a matrix A, the number of rows of this matrix can be determined by myself, each row is the value of this vector:0.6,0.045,3. like this:
A (0.6,0.045,3,
0.6,0.045,3,
0.6,0.045,3,
0.6,0.045,3,
............)
You can specify number of rows and columns in matrix function.
vec <- c(0.6,0.045,3)
nr <- 4
matrix(vec, nrow = nr, ncol = length(vec), byrow = TRUE)
# [,1] [,2] [,3]
#[1,] 0.6 0.045 3
#[2,] 0.6 0.045 3
#[3,] 0.6 0.045 3
#[4,] 0.6 0.045 3
Another option is to use replicate :
t(replicate(nr, vec))

calculate frequency or percentage matrix in R

if I have the following:
mm <- matrix(0, 4, 3)
mm<-apply(mm, c(1, 2), function(x) sample(c(0, 1), 1))
> mm
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 0
[3,] 0 0 0
[4,] 1 0 1
How do I output a matrix that expresses the frequency or percentage of different columns where both values = 1. For example - there are two rows out of 4 where column 1 and column 2 both equal 1 (=0.5) and 1 row out of 4 where column 2 and column 3 = 1 (=0.25), so in this case I'd need:
[,1] [,2] [,3]
[1,] 1 0.5 0.5
[2,] 0.5 1 0.25
[3,] 0.5 0.25 1
I am not interested in comparing the same columns, so by default the diagonal remains at 1.
I thought I may get somewhere with cor(mm) where there may be a way to output co-frequencies or co-percentages instead of correlation coefficients but this appears to not be the case. But the dimensions of the final output should be an N by N column matrix as cor() outputs:
> cor(mm)
[,1] [,2] [,3]
[1,] 1.0000000 0.5773503 0.5773503
[2,] 0.5773503 1.0000000 0.0000000
[3,] 0.5773503 0.0000000 1.0000000
but obviously these are correlation coefficients, I just want to co-frequencies or co-percentages instead.
A base R solution is using crossprod, i.e.,
r <- `diag<-`(crossprod(mm)/nrow(mm),1)
such that
> r
[,1] [,2] [,3]
[1,] 1.0 0.50 0.50
[2,] 0.5 1.00 0.25
[3,] 0.5 0.25 1.00
DATA
mm <- structure(c(1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1), .Dim = 4:3)
set.seed(123)
mm <- matrix(0, 4, 3)
mm<-apply(mm, c(1, 2), function(x) sample(c(0, 1), 1))
combinations <- expand.grid(1:ncol(mm), 1:ncol(mm))
matrix(unlist(Map(function(x, y) {
if (x == y) {
res <- 1
} else {
res <- sum(mm[, x] * mm[, y]) / nrow(mm)
}
res
}, combinations[, 1], combinations[, 2])), 3)
# [,1] [,2] [,3]
# [1,] 1.00 0.25 0.0
# [2,] 0.25 1.00 0.5
# [3,] 0.00 0.50 1.0

Efficient way to generate a coincidence matrix

I want to generate a simple coincidence matrix, I've looked for R packages but could not find one that does this calculation so far, I don't know if the English term for this matrix is different than the Portuguese one... so, that's what I need to do.
I have a matrix:
[,1] [,2] [,3] [,4]
[1,] 1 1 2 1
[2,] 1 2 3 1
[3,] 2 3 1 2
[4,] 1 2 3 3
A coincidence matrix will be calculated comparing each element row by row to generate a dissimilarity distance with the formula:
Diss = 1 - (Coincidences / (Coincidences + Discordance))
So my resulting matrix is an symmetrical one with dim 4x4 and diagonal elements equal 0, so in the example my A(1,2) would it be:
A(1,2) = 1 - (2 / 4) = 0.5
A(1,3) = 1 - (0/4) = 1.0
And so on...
I have created a function to generate this matrix:
cs_matrix <- function (x) {
cs.mat <- matrix(rep(0,dim(x)[1]^2), ncol = dim(x)[1])
for (i in 1:dim(x)[1]){
for (j in 1:dim(x)[1]){
cs.mat[i,j] <- 1 - (sum(x[i,] == x[j,]) / dim(x)[2])
}
}
return(cs.mat)
}
The function works fine, but my actual Data Set has 2560 observations of 4 variables, thus generating a 2560 x 2560 coincidence matrix, and it takes quite some time to do the calculation. I wonder if there is a more efficient way of calculating this or even if there is already a package that can calculate this dissimilarity distance. This matrix will be later used in Cluster Analysis.
I think you can use outer
add <- function(x, y) sum(mat[x, ] == mat[y,])
nr <- seq_len(nrow(mat))
mat1 <- 1 - outer(nr, nr, Vectorize(add))/ncol(mat)
mat1
# [,1] [,2] [,3] [,4]
#[1,] 0.00 0.50 1 0.75
#[2,] 0.50 0.00 1 0.25
#[3,] 1.00 1.00 0 1.00
#[4,] 0.75 0.25 1 0.00
If diagonal elements need to be 1 do diag(mat1) <- 1.
data
mat <- structure(c(1, 1, 2, 1, 1, 2, 3, 2, 2, 3, 1, 3, 1, 1, 2, 3), .Dim = c(4L,4L))

Pasting a string matrix row-wise with a string vector

I am trying to concatenate multcompView letters with summary data into a matrix. I am using a for loop to run through the individual summary matrix cells and concatenate these with their respective letters. I am almost there but my matrix outputs both the original data and the pasted data (see below).
Function:
for (i in 1:nrow(X1))
tableRow = matrix(c(tableRow,paste(tableRow[i],letters$Letters[i],sep = "")),nrow = 1)
Where:
X1 is my summary table, tableRow is the first row of X1, and
letters contains the letter I want to concatenate with.
Returns:
[1,] "5.53 ± 0.77" "6.72 ± 1.18" "5.12 ± 0.44"
"5.24 ± 0.41" "5.53 ± 0.77a" "6.72 ± 1.18a" "5.12 ± 0.44a" "5.24 ± 0.41a"
Desired output:
[1,] "5.53 ± 0.77a" "6.72 ± 1.18a" "5.12 ± 0.44a" "5.24 ± 0.41a"
This will do all the work:
## example matrix
set.seed(0); X <- round(matrix(runif(12), nrow = 4, ncol = 3), 2)
# [,1] [,2] [,3]
# [1,] 0.90 0.91 0.66
# [2,] 0.27 0.20 0.63
# [3,] 0.37 0.90 0.06
# [4,] 0.57 0.94 0.21
matrix(paste0(X, letters[1:4]), nrow = nrow(X))
# [,1] [,2] [,3]
# [1,] "0.9a" "0.91a" "0.66a"
# [2,] "0.27b" "0.2b" "0.63b"
# [3,] "0.37c" "0.9c" "0.06c"
# [4,] "0.57d" "0.94d" "0.21d"
For your data, you can do:
matrix(paste0(X1, letters$Letters), nrow = nrow(X1))
Remark 1
My example here has some defect. You already have X1 as a character matrix, while my example X is numeric. When doing paste0(), numerical value 0.90 becomes "0.9" (because as.character(0.90) gives "0.9"). For your data there will be no such behaviour.
Remark 2
Oh, I actually find a way to avoid such behaviour.
X <- format(X)
# [,1] [,2] [,3]
# [1,] "0.90" "0.91" "0.66"
# [2,] "0.27" "0.20" "0.63"
# [3,] "0.37" "0.90" "0.06"
# [4,] "0.57" "0.94" "0.21"
Then doing paste0() is OK:
# [,1] [,2] [,3]
# [1,] "0.90a" "0.91a" "0.66a"
# [2,] "0.27b" "0.20b" "0.63b"
# [3,] "0.37c" "0.90c" "0.06c"
# [4,] "0.57d" "0.94d" "0.21d"

How to search through sequentially numbered matrix variables in R

I have a question pertaining to R.
I have some sequentially numbered matrices (all of the same dimensions) and I want to search them all and produce a final matrix that contains (for each matrix element) the number of times a defined threshold was exceeded.
As an example, I could choose a threshold of 0.7 and I could have the following three matrices.
matrix1
[,1] [,2] [,3]
[1,] 0.38 0.72 0.15
[2,] 0.58 0.37 0.09
[3,] 0.27 0.55 0.22
matrix2
[,1] [,2] [,3]
[1,] 0.19 0.78 0.72
[2,] 0.98 0.65 0.46
[3,] 0.72 0.57 0.76
matrix3
[,1] [,2] [,3]
[1,] 0.39 0.68 0.31
[2,] 0.40 0.05 0.92
[3,] 1.00 0.43 0.21
My desired output would then be
[,1] [,2] [,3]
[1,] 0 2 1
[2,] 1 0 1
[3,] 2 0 1
If I do this:
test <- matrix1 >= 0.7
test[test==TRUE] = 1
then I get a matrix that has a 1 where the threshold is exceeded, and 0 where it's not. So this is a key step in what I want to do:
test=
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 0 0 0
[3,] 0 0 0
My thought is to make a loop so I perform this calculation on each matrix and add each result of "test" so I get the final matrix I desire. But I'm not sure about two things: how to use a counter in the variable name "matrix", and second if there's a more efficient way than using a loop.
So I'm thinking of something like this:
output = matrix(0,3,3)
for i in 1:3 {
test <- matrixi >= 0.7
test[test==TRUE] = 1
output = output + test }
Of course, this doesn't work because matrixi does not translate to matrix1, matrix2, etc.
I really appreciate your help!!!
If you stored your matrices in a list you would find the manipulations easier:
lst <- list(matrix(c(0.38, 0.58, 0.27, 0.72, 0.37, 0.55, 0.15, 0.09, 0.22), nrow=3),
matrix(c(0.19, 0.98, 0.72, 0.78, 0.65, 0.57, 0.72, 0.46, 0.76), nrow=3),
matrix(c(0.39, 0.40, 1.00, 0.68, 0.05, 0.43, 0.31, 0.92, 0.21), nrow=3))
Reduce("+", lapply(lst, ">=", 0.7))
# [,1] [,2] [,3]
# [1,] 0 2 1
# [2,] 1 0 1
# [3,] 2 0 1
Here, the lapply(lst, ">=", 0.7) returns a list with x >= 0.7 called for every matrix x stored in lst. Then Reduce called with + sums them all up.
If you just have three matrices, you could just do something like lst <- list(matrix1, matrix2, matrix3). However, if you have a lot more (let's say 100, numbered 1 through 100), it's probably easier to do lst <- lapply(1:100, function(x) get(paste0("matrix", x))) or lst <- mget(paste0("matrix", 1:100)).
For 100 matrices, each of size 100 x 100 (based on your comment this is roughly the size of your use case), the Reduce approach with a list seems to be a bit faster than the rowSums approach with an array, though both are quick:
# Setup test data
set.seed(144)
for (i in seq(100)) {
assign(paste0("matrix", i), matrix(rnorm(10000), nrow=100))
}
all.equal(sum.josilber(), sum.gavin())
# [1] TRUE
library(microbenchmark)
microbenchmark(sum.josilber(), sum.gavin())
# Unit: milliseconds
# expr min lq median uq max neval
# sum.josilber() 6.534432 11.11292 12.47216 17.13995 160.1497 100
# sum.gavin() 11.421577 16.54199 18.62949 23.09079 165.6413 100
If you put the matrices in an array, this is easy to do without a loop. Here's an example:
## dummy data
set.seed(1)
m1 <- matrix(runif(9), ncol = 3)
m2 <- matrix(runif(9), ncol = 3)
m3 <- matrix(runif(9), ncol = 3)
Stick these into an array
arr <- array(c(m1, m2, m3), dim = c(3,3,3))
Now each matrix is like a plate and the array is a stack of these plates.
Do as you did and convert the array into an indicator array (you don't need to save this step, it could be done inline in the next call)
ind <- arr > 0.7
This gives:
> ind
, , 1
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE TRUE FALSE
, , 2
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE TRUE
[3,] FALSE TRUE TRUE
, , 3
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] TRUE FALSE FALSE
[3,] TRUE FALSE FALSE
Now use the rowSums() function to compute the values you want
> rowSums(ind, dims = 2)
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 1 0 1
[3,] 1 2 1
Note that the thing that is summed over in rowSums() is (somewhat confusing!) the dimension dims + 1. In this case, we are summing the values down through the stack of plates (the array) for each 3*3 cell, hence the 9 values in the output.
If you need to get your objects into the array form, you can do this via
arr2 <- do.call("cbind", mget(c("m1","m2","m3")))
dim(arr2) <- c(3,3,3) # c(nrow(m1), ncol(m1), nmat)
> all.equal(arr, arr2)
[1] TRUE
For larger problems (more matrices) use something like
nmat <- 200 ## number matrices
matrices <- paste0("m", seq_len(nmat))
arr <- do.call("cbind", mget(matrices))
dim(arr) <- c(dim(m1), nmat)

Resources