Pasting a string matrix row-wise with a string vector - r

I am trying to concatenate multcompView letters with summary data into a matrix. I am using a for loop to run through the individual summary matrix cells and concatenate these with their respective letters. I am almost there but my matrix outputs both the original data and the pasted data (see below).
Function:
for (i in 1:nrow(X1))
tableRow = matrix(c(tableRow,paste(tableRow[i],letters$Letters[i],sep = "")),nrow = 1)
Where:
X1 is my summary table, tableRow is the first row of X1, and
letters contains the letter I want to concatenate with.
Returns:
[1,] "5.53 ± 0.77" "6.72 ± 1.18" "5.12 ± 0.44"
"5.24 ± 0.41" "5.53 ± 0.77a" "6.72 ± 1.18a" "5.12 ± 0.44a" "5.24 ± 0.41a"
Desired output:
[1,] "5.53 ± 0.77a" "6.72 ± 1.18a" "5.12 ± 0.44a" "5.24 ± 0.41a"

This will do all the work:
## example matrix
set.seed(0); X <- round(matrix(runif(12), nrow = 4, ncol = 3), 2)
# [,1] [,2] [,3]
# [1,] 0.90 0.91 0.66
# [2,] 0.27 0.20 0.63
# [3,] 0.37 0.90 0.06
# [4,] 0.57 0.94 0.21
matrix(paste0(X, letters[1:4]), nrow = nrow(X))
# [,1] [,2] [,3]
# [1,] "0.9a" "0.91a" "0.66a"
# [2,] "0.27b" "0.2b" "0.63b"
# [3,] "0.37c" "0.9c" "0.06c"
# [4,] "0.57d" "0.94d" "0.21d"
For your data, you can do:
matrix(paste0(X1, letters$Letters), nrow = nrow(X1))
Remark 1
My example here has some defect. You already have X1 as a character matrix, while my example X is numeric. When doing paste0(), numerical value 0.90 becomes "0.9" (because as.character(0.90) gives "0.9"). For your data there will be no such behaviour.
Remark 2
Oh, I actually find a way to avoid such behaviour.
X <- format(X)
# [,1] [,2] [,3]
# [1,] "0.90" "0.91" "0.66"
# [2,] "0.27" "0.20" "0.63"
# [3,] "0.37" "0.90" "0.06"
# [4,] "0.57" "0.94" "0.21"
Then doing paste0() is OK:
# [,1] [,2] [,3]
# [1,] "0.90a" "0.91a" "0.66a"
# [2,] "0.27b" "0.20b" "0.63b"
# [3,] "0.37c" "0.90c" "0.06c"
# [4,] "0.57d" "0.94d" "0.21d"

Related

I want to create hypothetical data based on a correlation matrix

I am trying to write a function that creates hypothetical data. The function gets a correlation matrix (R) and an integer (n) as input values. It will create a data frame (or matrix) with n rows and the bivariate correlations among the columns must be the same as in the correlation matrix (R).
Here is what I have thus far:
My function is:
generate_data = function(R, n) {
r.eigen = eigen(R)
factors = t(t(r.eigen$vectors) * sqrt(r.eigen$values))
data = matrix(rnorm(n * ncol(R)), n)
data = data %*% t(factors)
return(data)}
And here are the outputs that I get with different n's.
Example correlation matrix:
R = matrix(c(1, .06, -.1, .1, .06, 1, -.51, .14, -.1, -.51, 1, .12, .1, .14, .12, 1), ncol = 4)
> R
[,1] [,2] [,3] [,4]
[1,] 1.00 0.06 -0.10 0.10
[2,] 0.06 1.00 -0.51 0.14
[3,] -0.10 -0.51 1.00 0.12
[4,] 0.10 0.14 0.12 1.00
And here are the correlation matrices that I can get based on the n (number of rows).
> round(cor(generate_data(R, 100)), 2)
[,1] [,2] [,3] [,4]
[1,] 1.00 -0.23 0.09 0.12
[2,] -0.23 1.00 -0.44 0.23
[3,] 0.09 -0.44 1.00 0.09
[4,] 0.12 0.23 0.09 1.00
> round(cor(generate_data(R, 1000)), 2)
[,1] [,2] [,3] [,4]
[1,] 1.00 0.05 -0.11 0.10
[2,] 0.05 1.00 -0.51 0.13
[3,] -0.11 -0.51 1.00 0.17
[4,] 0.10 0.13 0.17 1.00
> round(cor(generate_data(R, 10000)), 2)
[,1] [,2] [,3] [,4]
[1,] 1.00 0.05 -0.09 0.10
[2,] 0.05 1.00 -0.50 0.13
[3,] -0.09 -0.50 1.00 0.14
[4,] 0.10 0.13 0.14 1.00
My function seems to work reasonably well for large n's, but it does not work for small n's. Is there a way to make a function that works for smaller n's as well?
Hope this is clear enough. I appreciate all help.
You can use rmvnorm() function from the mvtnorm package. However, for small n the sample variation is less likely to have stabilized and you will get the small sample variation that you also saw with your own function. There's nothing you can do about that - that is randomness.
library("mvtnorm")
R <- matrix(c(1, .06, -.1, .1, .06, 1, -.51, .14, -.1, -.51, 1, .12, .1, .14, .12, 1), ncol = 4)
x <- rmvnorm(n=500, mean=c(0,0,0,0), sigma=R)
This produces
cor(x)
[,1] [,2] [,3] [,4]
[1,] 1.0000000 0.1023989 -0.10946186 0.12230412
[2,] 0.1023989 1.0000000 -0.53853097 0.15985618
[3,] -0.1094619 -0.5385310 1.00000000 0.05587178
[4,] 0.1223041 0.1598562 0.05587178 1.00000000
which seems pretty close to what you specified with R. If I increase the number of samples to 10000 I get
x <- rmvnorm(n=10000, mean=c(0,0,0,0), sigma=R)
cor(x)
[,1] [,2] [,3] [,4]
[1,] 1.00000000 0.05969971 -0.08121426 0.1121826
[2,] 0.05969971 1.00000000 -0.51305601 0.1247779
[3,] -0.08121426 -0.51305601 1.00000000 0.1340828
[4,] 0.11218257 0.12477793 0.13408277 1.0000000

calculate frequency or percentage matrix in R

if I have the following:
mm <- matrix(0, 4, 3)
mm<-apply(mm, c(1, 2), function(x) sample(c(0, 1), 1))
> mm
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 0
[3,] 0 0 0
[4,] 1 0 1
How do I output a matrix that expresses the frequency or percentage of different columns where both values = 1. For example - there are two rows out of 4 where column 1 and column 2 both equal 1 (=0.5) and 1 row out of 4 where column 2 and column 3 = 1 (=0.25), so in this case I'd need:
[,1] [,2] [,3]
[1,] 1 0.5 0.5
[2,] 0.5 1 0.25
[3,] 0.5 0.25 1
I am not interested in comparing the same columns, so by default the diagonal remains at 1.
I thought I may get somewhere with cor(mm) where there may be a way to output co-frequencies or co-percentages instead of correlation coefficients but this appears to not be the case. But the dimensions of the final output should be an N by N column matrix as cor() outputs:
> cor(mm)
[,1] [,2] [,3]
[1,] 1.0000000 0.5773503 0.5773503
[2,] 0.5773503 1.0000000 0.0000000
[3,] 0.5773503 0.0000000 1.0000000
but obviously these are correlation coefficients, I just want to co-frequencies or co-percentages instead.
A base R solution is using crossprod, i.e.,
r <- `diag<-`(crossprod(mm)/nrow(mm),1)
such that
> r
[,1] [,2] [,3]
[1,] 1.0 0.50 0.50
[2,] 0.5 1.00 0.25
[3,] 0.5 0.25 1.00
DATA
mm <- structure(c(1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1), .Dim = 4:3)
set.seed(123)
mm <- matrix(0, 4, 3)
mm<-apply(mm, c(1, 2), function(x) sample(c(0, 1), 1))
combinations <- expand.grid(1:ncol(mm), 1:ncol(mm))
matrix(unlist(Map(function(x, y) {
if (x == y) {
res <- 1
} else {
res <- sum(mm[, x] * mm[, y]) / nrow(mm)
}
res
}, combinations[, 1], combinations[, 2])), 3)
# [,1] [,2] [,3]
# [1,] 1.00 0.25 0.0
# [2,] 0.25 1.00 0.5
# [3,] 0.00 0.50 1.0

Generate a Matrix with Column Sum Constraint in R

For example, I can easily specify an arbitrary matrix like
x <- matrix(c(10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25),
nrow = 4, ncol = 6, byrow = TRUE)
But in an optimization problem, I am required to use a matrix x under constraints max(colSums(x)) <= 1. How can I generate such a matrix?
OK, I will put my comment into an answer.
x <- matrix(c(10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25,
10,0.25,0.25,0.25,0.25,0.25),
nrow = 4, ncol = 6, byrow = TRUE)
alpha <- colSums(x) + runif(ncol(x), 0, abs(mean(x)))
y <- t(t(x) / alpha)
> x
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 10 0.25 0.25 0.25 0.25 0.25
[2,] 10 0.25 0.25 0.25 0.25 0.25
[3,] 10 0.25 0.25 0.25 0.25 0.25
[4,] 10 0.25 0.25 0.25 0.25 0.25
> y
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.2397915 0.1814021 0.09312788 0.09021147 0.1116572 0.1147005
[2,] 0.2397915 0.1814021 0.09312788 0.09021147 0.1116572 0.1147005
[3,] 0.2397915 0.1814021 0.09312788 0.09021147 0.1116572 0.1147005
[4,] 0.2397915 0.1814021 0.09312788 0.09021147 0.1116572 0.1147005
> alpha ## random scaling factor
[1] 41.702890 1.378154 2.684481 2.771266 2.238996 2.179589
> colSums(y)
[1] 0.9591661 0.7256085 0.3725115 0.3608459 0.4466288 0.4588021
I did not set seed. So when you run, the result will be different.
Follow-up
Perfect thanks a lot. Just one additional comment. With this method, can I control specific column sum to be <=1, i.e, what if I don't want the constraint for the first column but for others!
You can manually set alpha[1] to be 1, so that essentially no scaling is done.
alpha <- colSums(x) + runif(0, abs(mean(x)), ncol(x))
alpha[1] <- 1
Then proceed. If you want 3rd column not affected as well, do
alpha[c(1,3)] <- 1
random seed
Compare:
runif(4)
## [1] 0.1300864 0.5689224 0.5594464 0.8778068
runif(4)
## [1] 0.3320244 0.5838847 0.0466045 0.2594867
But if we set seed:
set.seed(0); runif(4)
## [1] 0.8966972 0.2655087 0.3721239 0.5728534
set.seed(0); runif(4)
## [1] 0.8966972 0.2655087 0.3721239 0.5728534
Without using seed, random numbers are not reproducible. I did not put set.seed(0) before alpha, so when you run the code, you will see different alpha, y.
The seed can be any integers, not necessarily 0. For example, both -2016 and 2016 works. But for different seed, you get different set of random numbers. However, whenever you use the same seed, the random numbers are the same.

How to search through sequentially numbered matrix variables in R

I have a question pertaining to R.
I have some sequentially numbered matrices (all of the same dimensions) and I want to search them all and produce a final matrix that contains (for each matrix element) the number of times a defined threshold was exceeded.
As an example, I could choose a threshold of 0.7 and I could have the following three matrices.
matrix1
[,1] [,2] [,3]
[1,] 0.38 0.72 0.15
[2,] 0.58 0.37 0.09
[3,] 0.27 0.55 0.22
matrix2
[,1] [,2] [,3]
[1,] 0.19 0.78 0.72
[2,] 0.98 0.65 0.46
[3,] 0.72 0.57 0.76
matrix3
[,1] [,2] [,3]
[1,] 0.39 0.68 0.31
[2,] 0.40 0.05 0.92
[3,] 1.00 0.43 0.21
My desired output would then be
[,1] [,2] [,3]
[1,] 0 2 1
[2,] 1 0 1
[3,] 2 0 1
If I do this:
test <- matrix1 >= 0.7
test[test==TRUE] = 1
then I get a matrix that has a 1 where the threshold is exceeded, and 0 where it's not. So this is a key step in what I want to do:
test=
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 0 0 0
[3,] 0 0 0
My thought is to make a loop so I perform this calculation on each matrix and add each result of "test" so I get the final matrix I desire. But I'm not sure about two things: how to use a counter in the variable name "matrix", and second if there's a more efficient way than using a loop.
So I'm thinking of something like this:
output = matrix(0,3,3)
for i in 1:3 {
test <- matrixi >= 0.7
test[test==TRUE] = 1
output = output + test }
Of course, this doesn't work because matrixi does not translate to matrix1, matrix2, etc.
I really appreciate your help!!!
If you stored your matrices in a list you would find the manipulations easier:
lst <- list(matrix(c(0.38, 0.58, 0.27, 0.72, 0.37, 0.55, 0.15, 0.09, 0.22), nrow=3),
matrix(c(0.19, 0.98, 0.72, 0.78, 0.65, 0.57, 0.72, 0.46, 0.76), nrow=3),
matrix(c(0.39, 0.40, 1.00, 0.68, 0.05, 0.43, 0.31, 0.92, 0.21), nrow=3))
Reduce("+", lapply(lst, ">=", 0.7))
# [,1] [,2] [,3]
# [1,] 0 2 1
# [2,] 1 0 1
# [3,] 2 0 1
Here, the lapply(lst, ">=", 0.7) returns a list with x >= 0.7 called for every matrix x stored in lst. Then Reduce called with + sums them all up.
If you just have three matrices, you could just do something like lst <- list(matrix1, matrix2, matrix3). However, if you have a lot more (let's say 100, numbered 1 through 100), it's probably easier to do lst <- lapply(1:100, function(x) get(paste0("matrix", x))) or lst <- mget(paste0("matrix", 1:100)).
For 100 matrices, each of size 100 x 100 (based on your comment this is roughly the size of your use case), the Reduce approach with a list seems to be a bit faster than the rowSums approach with an array, though both are quick:
# Setup test data
set.seed(144)
for (i in seq(100)) {
assign(paste0("matrix", i), matrix(rnorm(10000), nrow=100))
}
all.equal(sum.josilber(), sum.gavin())
# [1] TRUE
library(microbenchmark)
microbenchmark(sum.josilber(), sum.gavin())
# Unit: milliseconds
# expr min lq median uq max neval
# sum.josilber() 6.534432 11.11292 12.47216 17.13995 160.1497 100
# sum.gavin() 11.421577 16.54199 18.62949 23.09079 165.6413 100
If you put the matrices in an array, this is easy to do without a loop. Here's an example:
## dummy data
set.seed(1)
m1 <- matrix(runif(9), ncol = 3)
m2 <- matrix(runif(9), ncol = 3)
m3 <- matrix(runif(9), ncol = 3)
Stick these into an array
arr <- array(c(m1, m2, m3), dim = c(3,3,3))
Now each matrix is like a plate and the array is a stack of these plates.
Do as you did and convert the array into an indicator array (you don't need to save this step, it could be done inline in the next call)
ind <- arr > 0.7
This gives:
> ind
, , 1
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE TRUE FALSE
, , 2
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE TRUE
[3,] FALSE TRUE TRUE
, , 3
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] TRUE FALSE FALSE
[3,] TRUE FALSE FALSE
Now use the rowSums() function to compute the values you want
> rowSums(ind, dims = 2)
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 1 0 1
[3,] 1 2 1
Note that the thing that is summed over in rowSums() is (somewhat confusing!) the dimension dims + 1. In this case, we are summing the values down through the stack of plates (the array) for each 3*3 cell, hence the 9 values in the output.
If you need to get your objects into the array form, you can do this via
arr2 <- do.call("cbind", mget(c("m1","m2","m3")))
dim(arr2) <- c(3,3,3) # c(nrow(m1), ncol(m1), nmat)
> all.equal(arr, arr2)
[1] TRUE
For larger problems (more matrices) use something like
nmat <- 200 ## number matrices
matrices <- paste0("m", seq_len(nmat))
arr <- do.call("cbind", mget(matrices))
dim(arr) <- c(dim(m1), nmat)

How do I find peak values/row numbers?

I have a large dataset (202k points). I know that there are 8 values over 0.5. I want to subset on those rows.
How do I find/return a list the row numbers where the values are > 0.5?
If the dataset is a vector named x:
(1:length(x))[x > 0.5]
If the dataset is a data.frame or matrix named x and the variable of interest is in column j:
(1:nrow(x))[x[,j] > 0.5]
But if you just want to find the subset and don't really need the row numbers, use
subset(x, x > 0.5)
for a vector and
subset(x, x[,j] > 0.5)
for a matrix or data.frame.
which(x > 0.5)
Here's some dummy data:
D<-matrix(c(0.6,0.1,0.1,0.2,0.1,0.1,0.23,0.1,0.8,0.2,0.2,0.2),nrow=3)
Which looks like:
> D
[,1] [,2] [,3] [,4]
[1,] 0.6 0.2 0.23 0.2
[2,] 0.1 0.1 0.10 0.2
[3,] 0.1 0.1 0.80 0.2
And here's the logical row index,
index <- (rowSums(D>0.5))>=1
You can use it to extract the rows you want:
PeakRows <- D[index,]
Which looks like this:
> PeakRows
[,1] [,2] [,3] [,4]
[1,] 0.6 0.2 0.23 0.2
[2,] 0.1 0.1 0.80 0.2
Using the argument arr.ind=TRUE with which is a great way for finding the row (or column) numbers where a condition is TRUE,
df <- matrix(c(0.6,0.2,0.1,0.25,0.11,0.13,0.23,0.18,0.21,0.29,0.23,0.51), nrow=4)
# [,1] [,2] [,3]
# [1,] 0.60 0.11 0.21
# [2,] 0.20 0.13 0.29
# [3,] 0.10 0.23 0.23
# [4,] 0.25 0.18 0.51
which with arr.ind=TRUE returns the array indices where the condition is TRUE
which(df > 0.5, arr.ind=TRUE)
row col
[1,] 1 1
[2,] 4 3
so the subset becomes
df[-which(df > 0.5, arr.ind=TRUE)[, "row"], ]
# [,1] [,2] [,3]
# [1,] 0.2 0.13 0.29
# [2,] 0.1 0.23 0.23

Resources