How to search through sequentially numbered matrix variables in R - r

I have a question pertaining to R.
I have some sequentially numbered matrices (all of the same dimensions) and I want to search them all and produce a final matrix that contains (for each matrix element) the number of times a defined threshold was exceeded.
As an example, I could choose a threshold of 0.7 and I could have the following three matrices.
matrix1
[,1] [,2] [,3]
[1,] 0.38 0.72 0.15
[2,] 0.58 0.37 0.09
[3,] 0.27 0.55 0.22
matrix2
[,1] [,2] [,3]
[1,] 0.19 0.78 0.72
[2,] 0.98 0.65 0.46
[3,] 0.72 0.57 0.76
matrix3
[,1] [,2] [,3]
[1,] 0.39 0.68 0.31
[2,] 0.40 0.05 0.92
[3,] 1.00 0.43 0.21
My desired output would then be
[,1] [,2] [,3]
[1,] 0 2 1
[2,] 1 0 1
[3,] 2 0 1
If I do this:
test <- matrix1 >= 0.7
test[test==TRUE] = 1
then I get a matrix that has a 1 where the threshold is exceeded, and 0 where it's not. So this is a key step in what I want to do:
test=
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 0 0 0
[3,] 0 0 0
My thought is to make a loop so I perform this calculation on each matrix and add each result of "test" so I get the final matrix I desire. But I'm not sure about two things: how to use a counter in the variable name "matrix", and second if there's a more efficient way than using a loop.
So I'm thinking of something like this:
output = matrix(0,3,3)
for i in 1:3 {
test <- matrixi >= 0.7
test[test==TRUE] = 1
output = output + test }
Of course, this doesn't work because matrixi does not translate to matrix1, matrix2, etc.
I really appreciate your help!!!

If you stored your matrices in a list you would find the manipulations easier:
lst <- list(matrix(c(0.38, 0.58, 0.27, 0.72, 0.37, 0.55, 0.15, 0.09, 0.22), nrow=3),
matrix(c(0.19, 0.98, 0.72, 0.78, 0.65, 0.57, 0.72, 0.46, 0.76), nrow=3),
matrix(c(0.39, 0.40, 1.00, 0.68, 0.05, 0.43, 0.31, 0.92, 0.21), nrow=3))
Reduce("+", lapply(lst, ">=", 0.7))
# [,1] [,2] [,3]
# [1,] 0 2 1
# [2,] 1 0 1
# [3,] 2 0 1
Here, the lapply(lst, ">=", 0.7) returns a list with x >= 0.7 called for every matrix x stored in lst. Then Reduce called with + sums them all up.
If you just have three matrices, you could just do something like lst <- list(matrix1, matrix2, matrix3). However, if you have a lot more (let's say 100, numbered 1 through 100), it's probably easier to do lst <- lapply(1:100, function(x) get(paste0("matrix", x))) or lst <- mget(paste0("matrix", 1:100)).
For 100 matrices, each of size 100 x 100 (based on your comment this is roughly the size of your use case), the Reduce approach with a list seems to be a bit faster than the rowSums approach with an array, though both are quick:
# Setup test data
set.seed(144)
for (i in seq(100)) {
assign(paste0("matrix", i), matrix(rnorm(10000), nrow=100))
}
all.equal(sum.josilber(), sum.gavin())
# [1] TRUE
library(microbenchmark)
microbenchmark(sum.josilber(), sum.gavin())
# Unit: milliseconds
# expr min lq median uq max neval
# sum.josilber() 6.534432 11.11292 12.47216 17.13995 160.1497 100
# sum.gavin() 11.421577 16.54199 18.62949 23.09079 165.6413 100

If you put the matrices in an array, this is easy to do without a loop. Here's an example:
## dummy data
set.seed(1)
m1 <- matrix(runif(9), ncol = 3)
m2 <- matrix(runif(9), ncol = 3)
m3 <- matrix(runif(9), ncol = 3)
Stick these into an array
arr <- array(c(m1, m2, m3), dim = c(3,3,3))
Now each matrix is like a plate and the array is a stack of these plates.
Do as you did and convert the array into an indicator array (you don't need to save this step, it could be done inline in the next call)
ind <- arr > 0.7
This gives:
> ind
, , 1
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE TRUE FALSE
, , 2
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE TRUE
[3,] FALSE TRUE TRUE
, , 3
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] TRUE FALSE FALSE
[3,] TRUE FALSE FALSE
Now use the rowSums() function to compute the values you want
> rowSums(ind, dims = 2)
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 1 0 1
[3,] 1 2 1
Note that the thing that is summed over in rowSums() is (somewhat confusing!) the dimension dims + 1. In this case, we are summing the values down through the stack of plates (the array) for each 3*3 cell, hence the 9 values in the output.
If you need to get your objects into the array form, you can do this via
arr2 <- do.call("cbind", mget(c("m1","m2","m3")))
dim(arr2) <- c(3,3,3) # c(nrow(m1), ncol(m1), nmat)
> all.equal(arr, arr2)
[1] TRUE
For larger problems (more matrices) use something like
nmat <- 200 ## number matrices
matrices <- paste0("m", seq_len(nmat))
arr <- do.call("cbind", mget(matrices))
dim(arr) <- c(dim(m1), nmat)

Related

Forloop using variable name to fill array in R

within a function in R I need to fill in a specific array. It is working when I write out all the lines one by one by hand, but I was wondering if it is possible to use a forloop, as the array will be way bigger than the example below.
A simplified example of what I try to do:
dt <- data.frame(prob_name = c("q_1", "q_2", "p_1", "p_2", "p_3"),
prob=c(100,200,0.07, 0.08, 0.09))
dt <- setNames(data.frame(t(dt[,-1])), dt[,1])
trans_mat <- array(0, dim = c(2, 2, 3))
for (i in 1:nrow(dt)) {
trans_mat[1, 2, i] <- p_i
}
I want those specific places in the array to be filled with the corresponding probability, so the array will be
1) 0, 0.07
0, 0
2) 0, 0.08
0, 0
etc
Is there a way to do this with a forloop (as the forloop is not recognizing the "i" in "p_i"), or do I have to write this all out like
trans_mat[1,2,1] <- p_1
Thanks in advance!
Loop over the sequence of third dimension of 'trans_mat' instead of the nrow of 'dt' as number of rows of dt is just 1., then extract ([[) the column 'p_', i, by pasteing and do the assignment
for(i in seq(dim(trans_mat)[3])) trans_mat[1, 2, i] <- dt[[paste0("p_", i)]]
-output
> trans_mat
, , 1
[,1] [,2]
[1,] 0 0.07
[2,] 0 0.00
, , 2
[,1] [,2]
[1,] 0 0.08
[2,] 0 0.00
, , 3
[,1] [,2]
[1,] 0 0.09
[2,] 0 0.00
Using replace in sapply.
sapply(dt[1, 3:5], \(x) replace(array(0, c(2, 2)), 3, x), simplify='array')
# , , p_1
#
# [,1] [,2]
# [1,] 0 0.07
# [2,] 0 0.00
#
# , , p_2
#
# [,1] [,2]
# [1,] 0 0.08
# [2,] 0 0.00
#
# , , p_3
#
# [,1] [,2]
# [1,] 0 0.09
# [2,] 0 0.00
Data:
dt <- structure(list(q_1 = 100, q_2 = 200, p_1 = 0.07, p_2 = 0.08,
p_3 = 0.09), class = "data.frame", row.names = c(NA, -1L))

How to repeat in R

I am a newbie in R, now I have a vector H(0.6,0.045,3), I want to create a matrix A, the number of rows of this matrix can be determined by myself, each row is the value of this vector:0.6,0.045,3. like this:
A (0.6,0.045,3,
0.6,0.045,3,
0.6,0.045,3,
0.6,0.045,3,
............)
You can specify number of rows and columns in matrix function.
vec <- c(0.6,0.045,3)
nr <- 4
matrix(vec, nrow = nr, ncol = length(vec), byrow = TRUE)
# [,1] [,2] [,3]
#[1,] 0.6 0.045 3
#[2,] 0.6 0.045 3
#[3,] 0.6 0.045 3
#[4,] 0.6 0.045 3
Another option is to use replicate :
t(replicate(nr, vec))

calculate frequency or percentage matrix in R

if I have the following:
mm <- matrix(0, 4, 3)
mm<-apply(mm, c(1, 2), function(x) sample(c(0, 1), 1))
> mm
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 0
[3,] 0 0 0
[4,] 1 0 1
How do I output a matrix that expresses the frequency or percentage of different columns where both values = 1. For example - there are two rows out of 4 where column 1 and column 2 both equal 1 (=0.5) and 1 row out of 4 where column 2 and column 3 = 1 (=0.25), so in this case I'd need:
[,1] [,2] [,3]
[1,] 1 0.5 0.5
[2,] 0.5 1 0.25
[3,] 0.5 0.25 1
I am not interested in comparing the same columns, so by default the diagonal remains at 1.
I thought I may get somewhere with cor(mm) where there may be a way to output co-frequencies or co-percentages instead of correlation coefficients but this appears to not be the case. But the dimensions of the final output should be an N by N column matrix as cor() outputs:
> cor(mm)
[,1] [,2] [,3]
[1,] 1.0000000 0.5773503 0.5773503
[2,] 0.5773503 1.0000000 0.0000000
[3,] 0.5773503 0.0000000 1.0000000
but obviously these are correlation coefficients, I just want to co-frequencies or co-percentages instead.
A base R solution is using crossprod, i.e.,
r <- `diag<-`(crossprod(mm)/nrow(mm),1)
such that
> r
[,1] [,2] [,3]
[1,] 1.0 0.50 0.50
[2,] 0.5 1.00 0.25
[3,] 0.5 0.25 1.00
DATA
mm <- structure(c(1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1), .Dim = 4:3)
set.seed(123)
mm <- matrix(0, 4, 3)
mm<-apply(mm, c(1, 2), function(x) sample(c(0, 1), 1))
combinations <- expand.grid(1:ncol(mm), 1:ncol(mm))
matrix(unlist(Map(function(x, y) {
if (x == y) {
res <- 1
} else {
res <- sum(mm[, x] * mm[, y]) / nrow(mm)
}
res
}, combinations[, 1], combinations[, 2])), 3)
# [,1] [,2] [,3]
# [1,] 1.00 0.25 0.0
# [2,] 0.25 1.00 0.5
# [3,] 0.00 0.50 1.0

Convert ranks assigned by rank function into values

Sorry for the dummy question
My sample data looks like this
DF
a b c
0.01 0.02 0.03
0.08 0.09 0.10
I use rank to assign ranks for values in DF as
s <- sapply(DF, rank, ties.method ="average")
How can then i assign values to ranks? Or apparently i don't understand something.
Thank you for any suggestions.
not sure if I understand question, assuming you want ranks within column (variable)
> set.seed(100)
> df<-matrix(rnorm(6),ncol=3)
> df
[,1] [,2] [,3]
[1,] -0.5021924 -0.07891709 0.1169713
[2,] 0.1315312 0.88678481 0.3186301
> s <- apply(df,2, rank, ties.method ="average")
> s
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 2 2
or if across all data.
> s <- matrix(rank(c(df), ties.method ="average"), ncol=3)
> s
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 6 5

How do I find peak values/row numbers?

I have a large dataset (202k points). I know that there are 8 values over 0.5. I want to subset on those rows.
How do I find/return a list the row numbers where the values are > 0.5?
If the dataset is a vector named x:
(1:length(x))[x > 0.5]
If the dataset is a data.frame or matrix named x and the variable of interest is in column j:
(1:nrow(x))[x[,j] > 0.5]
But if you just want to find the subset and don't really need the row numbers, use
subset(x, x > 0.5)
for a vector and
subset(x, x[,j] > 0.5)
for a matrix or data.frame.
which(x > 0.5)
Here's some dummy data:
D<-matrix(c(0.6,0.1,0.1,0.2,0.1,0.1,0.23,0.1,0.8,0.2,0.2,0.2),nrow=3)
Which looks like:
> D
[,1] [,2] [,3] [,4]
[1,] 0.6 0.2 0.23 0.2
[2,] 0.1 0.1 0.10 0.2
[3,] 0.1 0.1 0.80 0.2
And here's the logical row index,
index <- (rowSums(D>0.5))>=1
You can use it to extract the rows you want:
PeakRows <- D[index,]
Which looks like this:
> PeakRows
[,1] [,2] [,3] [,4]
[1,] 0.6 0.2 0.23 0.2
[2,] 0.1 0.1 0.80 0.2
Using the argument arr.ind=TRUE with which is a great way for finding the row (or column) numbers where a condition is TRUE,
df <- matrix(c(0.6,0.2,0.1,0.25,0.11,0.13,0.23,0.18,0.21,0.29,0.23,0.51), nrow=4)
# [,1] [,2] [,3]
# [1,] 0.60 0.11 0.21
# [2,] 0.20 0.13 0.29
# [3,] 0.10 0.23 0.23
# [4,] 0.25 0.18 0.51
which with arr.ind=TRUE returns the array indices where the condition is TRUE
which(df > 0.5, arr.ind=TRUE)
row col
[1,] 1 1
[2,] 4 3
so the subset becomes
df[-which(df > 0.5, arr.ind=TRUE)[, "row"], ]
# [,1] [,2] [,3]
# [1,] 0.2 0.13 0.29
# [2,] 0.1 0.23 0.23

Resources