R - change matrix values based on another matrix indices - r

I have two matrices:
m1 <- matrix(runif(750), nrow = 50, byrow=T)
m2 <- matrix(rep(TRUE,750), nrow = 50, byrow=T)
For each m1 row, I need to find the indices of the two lowest values. Then, I need to use the remaining indices (i.e. not the two lowest values) to assign FALSE in m2.
It is fairly easy to do for one row:
ind <- order(m1[1,], decreasing=FALSE)[1:2]
m2[1,][-ind] <- FALSE
Therefore, I can use a loop to do the same for all rows:
for (i in 1:dim(m1)[1]){
ind <- order(m1[i,], decreasing=FALSE)[1:2]
m2[i,][-ind] <- FALSE
}
However, in my data set this loop runs slower than I would like (since my matrices are quite large - 500000x150000).
Is there any faster, R way to achieve the same result without the use of loops?

You can try the code below
m2 <- t(apply(m1,1,function(x) x %in% head(sort(x),2)))

You can try apply since you have matrix :
val <- rep(TRUE, ncol(m1))
m3 <- t(apply(m1, 1, function(x) {val[-order(x)[1:2]] <- FALSE;val}))

You can do:
m2 <- t(apply(m1, 1, function(x) rank(x)<3))

Using pmap
library(purrr)
pmap_dfr(as.data.frame(m1), ~ min_rank(c(...)) < 3)

Related

How to apply row-wise data frame operation in R

I have the data frame blood2 (49x3) to which I wish to apply the following row-operation.
cblood2[1,] <- ((blood2[1,]*mcabc)*100)/(sum(blood2[1,]*mcabc))
where, mcabc is a 1x3 row-vector.
I want to do the above operation on every row in blood2 (49 row-wise operations) and store the results in cblood2. But I do not want to write out each of the row-operations in my R code. Is there a way to do this with tidyverse or similar R packages?
I tried the following code but the results are different from what I expect.
cblood2[1:49,] <- ((blood2[1:49,]*mcabc)*100)/(sum(blood2[1:49,]*mcabc))
Any advice is greatly appreciated. Thank you!
Use apply with MARGIN = 1 for loop over the rows and apply the function
out1 <- t(apply(blood2, 1, function(x) x * mcabc * 100/(sum(x * mcabc))))
Or another option with vectorized functions (rowSums)
mcabc2 <- c(mcabc)[col(blood2)]
out2 <- ((blood2 * mcabc2) * 100)/(rowSums(blood2 *mcabc2))
-checking the outputs
> identical(out1, out2)
[1] TRUE
data
set.seed(24)
blood2 <- matrix(rnorm(49 * 3), nrow = 49, ncol = 3)
mcabc <- matrix(1:3, ncol = 3)

Two-way frequency table followed by matrix multiplication - high running time

I'm new to R, and trying to calculate the product between a fixed matrix to a 2-way frequency table for any combinations of columns in a dataframe or matrix and divide it by the sequence length (aka number of rows which is 15), the problem is that the running time increases dramatically when performing it on 1K sequences (1K columns). the goal is to use it with as much as possible sequences (more than 10 minutes, for 10K could be more than 1hr)
mat1 <- matrix(sample(LETTERS),ncol = 100,nrow = 15)
mat2 <- matrix(sample(abs(rnorm(26,0,3))),ncol=26,nrow=26)
rownames(mat2) <- LETTERS
colnames(mat2) <- LETTERS
diag(mat2) <- 0
test_vec <- c()
for (i in seq(ncol(mat1)-1)){
for(j in seq(i+1,ncol(mat1))){
s2 <- table(mat1[,i],mat1[,j]) # create 2-way frequency table
mat2_1 <- mat2
mat2_1 <- mat2_1[rownames(mat2_1) %in% rownames(s2),
colnames(mat2_1) %in% colnames(s2)]
calc <- ((1/nrow(mat1))*sum(mat2_1*s2))
test_vec <- append(test_vec,calc)
}}
Thanks for the help.
Here is an approach that converts mat1 to a data.table, and converts all the columns to factors, and uses table(..., exclude=NULL)
library(data.table)
m=as.data.table(mat1)[,lapply(.SD, factor, levels=LETTERS)]
g = combn(colnames(m),2, simplify = F)
result = sapply(g, function(x) sum(table(m[[x[1]]], m[[x[2]]], exclude=NULL)*mat2)/nrow(m))
Check equality:
sum(result-test_vec>1e-10)
[1] 0
Here there are 4950 combinations (100*99/2), but the number of combinations will increase quickly as nrow(mat1) increases (as you point out). You might find in that case that a parallelized version works well.
library(doParallel)
library(data.table)
registerDoParallel()
m=as.data.table(mat1)[,lapply(.SD, factor, levels=LETTERS)]
g = combn(colnames(m),2, simplify = F)
result = foreach(i=1:length(g), .combine=c) %dopar%
sum(table(m[[g[[i]][1]]], m[[g[[i]][2]]], exclude=NULL)*mat2)
result = result/nrow(m)

apply list of indices to list of dataframes

I need to apply a list of indices to a list of dataframes with a one on one mapping. First element of the list of indices goes to the first dataframe only and so on. List of indices applies to the rows in the dataframes.
And a list of complementary dataframes needs to created by selecting rows not mentioned in the indices list.
Here is some sample data:
set.seed(1)
A <- data.frame(matrix(rnorm(40,0,1), nrow = 10))
B <- data.frame(matrix(rnorm(40,2,3), nrow = 10))
C <- data.frame(matrix(rnorm(40,3,4), nrow = 10))
dflis <- list(A,B,C)
# Create a sample row index
ix <- lapply(lapply(dflis,nrow), sample, size = 6)
So far I have managed this working but ugly looking code:
dflis.train <- lapply(seq_along(dflis), function(x) dflis[[x]][ix[[x]],])
dflis.test <- lapply(seq_along(dflis), function(x) dflis[[x]][-ix[[x]],])
Can someone suggest something better, more elegant?
Use Map/mapply instead of the univariate lapply, so that you can iterate over both objects and apply a function, like:
Map(function(d,r) d[r,], dflis, ix)
Or if you want to be fancy:
Map(`[`, dflis, ix, TRUE)
Matches your requested answer.
identical(
Map(function(d,r) d[r,], dflis, ix),
lapply(seq_along(dflis), function(x) dflis[[x]][ix[[x]],])
)
#[1] TRUE

How to substitute negative values with a calculated value in an entire dataframe

I've got a huge dataframe with many negative values in different columns that should be equal to their original value*0.5.
I've tried to apply many R functions but it seems I can't find a single function to work for the entire dataframe.
I would like something like the following (not working) piece of code:
mydf[] <- replace(mydf[], mydf[] < 0, mydf[]*0.5)
You can simply do,
mydf[mydf<0] <- mydf[mydf<0] * 0.5
If you have values that are non-numeric, then you may want to apply this to only the numeric ones,
ind <- sapply(mydf, is.numeric)
mydf1 <- mydf[ind]
mydf1[mydf1<0] <- mydf1[mydf1<0] * 0.5
mydf[ind] <- mydf1
You could try using lapply() on the entire data frame, making the replacements on each column in succession.
df <- lapply(df, function(x) {
x <- ifelse(x < 0, x*0.5, x)
})
The lapply(), or list apply, function is intended to be used on lists, but data frames are a special type of list so this works here.
Demo
In the replace the values argument should be of the same length as the number of TRUE values in the list ('index' vector)
replace(mydf, mydf <0, mydf[mydf <0]*0.5)
Or another option is set from data.table, which would be very efficient
library(data.table)
for(j in seq_along(mydf)){
i1 <- mydf[[j]] < 0
set(mydf, i = which(i1), j= j, value = mydf[[j]][i1]*0.5)
}
data
set.seed(24)
mydf <- as.data.frame(matrix(rnorm(25), 5, 5))

Efficient algorithm to turn matrix subdiagonal to columns r

I have a non-square matrix and need to do some calculations on it's subdiagonals. I figure out that the best way is too turn subdiagonals to columns/rows and use functions like cumprod. Right now I use a for loop and exdiag defined as below:
exdiag <- function(mat, off=0) {mat[row(mat) == col(mat)+off]}
However it to be not really efficient. Do you know any other algorithm to achieve that kind of results.
A little example to show what I am doing:
exdiag <- function(mat, off=0) {mat[row(mat) == col(mat)+off]}
mat <- matrix(1:72, nrow = 12, ncol = 6)
newmat <- matrix(nrow=11, ncol=6)
for (i in 1:11){
newmat[i,] <- c(cumprod(exdiag(mat,i)),rep(0,max(6-12+i,0)))
}
Best regards,
Artur
The fastest but by far the most cryptic solution to get all possible diagonals from a non-square matrix, would be to treat your matrix as a vector and simply construct an id vector for selection. In the end you can transform it back to a matrix if you want.
The following function does that:
exdiag <- function(mat){
NR <- nrow(mat)
NC <- ncol(mat)
smalldim <- min(NC,NR)
if(NC > NR){
id <- seq_len(NR) +
seq.int(0,NR-1)*NR +
rep(seq.int(1,NC - 1), each = NR)*NR
} else if(NC < NR){
id <- seq_len(NC) +
seq.int(0,NC-1)*NR +
rep(seq.int(1,NR - 1), each = NC)
} else {
return(diag(mat))
}
out <- matrix(mat[id],nrow = smalldim)
id <- (ncol(out) + 1 - row(out)) - col(out) < 0
out[id] <- NA
return(out)
}
Keep in mind you have to take into account how your matrix is formed.
In both cases I follow the same logic:
first construct a sequence indicating positions along the smallest dimension
To this sequence, add 0, 1, 2, ... times the row length.
This creates the first diagonal. After doing this, you simply add a sequence that shifts the entire previous sequence by 1 (either down or to the right) until you reach the end of the matrix. To shift right, I need to multiply this sequence by the number of rows.
In the end you can use these indices to select the correct positions from mat, and return all that as a matrix. Due to the vectorized nature of this code, you have to check that the last subdiagonals are correct. These contain less elements than the first, so you have to replace the values not part of that subdiagonal by NA. Also here you can simply use an indexing trick.
You can use it as follows:
> diag1 <- exdiag(amatrix)
> diag2 <- exdiag(t(amatrix))
> identical(diag1, diag2)
[1] TRUE
In order to come to your result
amatrix <- matrix(1:72, ncol = 6)
diag1 <- exdiag(amatrix)
res <- apply(diag1,2,cumprod)
res[is.na(res)] <- 0
t(res)
You can modify the diag() function.
exdiag <- function(mat, off=0) {mat[row(mat) == col(mat)+off]}
exdiag2 <- function(matrix, off){diag(matrix[-1:-off,])}
Speed Test:
mat = diag(10, 10000,10000)
off = 4
> system.time(exdiag(mat,4))
user system elapsed
7.083 2.973 10.054
> system.time(exdiag2(mat,4))
user system elapsed
5.370 0.155 5.524
> system.time(diag(mat))
user system elapsed
0.002 0.000 0.002
It looks like that the subsetting from matrix take a lot of time, but it still performs better than your implementation. May be there are a lot of other subsetting approaches, which outperforms my solution. :)

Resources