Divide each each cell of large matrix by sum of its row - r

I have a site by species matrix. The dimensions are 375 x 360. Each value represents the frequency of a species in samples of that site.
I am trying to convert this matrix from frequencies to relative abundances at each site.
I've tried a few ways to achieve this and the only one that has worked is using a for loop. However, this takes an incredibly long time or simply never finishes.
Is there a function or a vectorised method of achieving this? I've included my for-loop as an example of what I am trying to do.
relative_abundance <- matrix(0, nrow= nrow(data_wide),
ncol=ncol(data), dimnames = dimnames(data))
i=0
j=0
for(i in 1:nrow(relative_abundance)){
for(j in 1:ncol(relative_abundance)){
species_freq <- data[i,j]
row_sum <- sum(data[i,])
relative_abundance[i,j] <- species_freq/row_sum
}
}

You could do this using apply, but scale in this case makes things even simplier. Assuming you want to divide columns by their sums:
set.seed(0)
relative_abundance <- matrix(sample(1:10, 360*375, TRUE), nrow= 375)
freqs <- scale(relative_abundance, center = FALSE,
scale = colSums(relative_abundance))
The matrix is too big to output here, but here's how it shoud look like:
> head(freqs[, 1:5])
[,1] [,2] [,3] [,4] [,5]
[1,] 0.004409603 0.0014231499 0.003439803 0.004052685 0.0024026910
[2,] 0.001469868 0.0023719165 0.002457002 0.005065856 0.0004805382
[3,] 0.001959824 0.0018975332 0.004914005 0.001519757 0.0043248438
[4,] 0.002939735 0.0042694497 0.002948403 0.002532928 0.0009610764
[5,] 0.004899559 0.0009487666 0.000982801 0.001519757 0.0028832292
[6,] 0.001469868 0.0023719165 0.002457002 0.002026342 0.0009610764
And a sanity check:
> head(colSums(freqs))
[1] 1 1 1 1 1 1
Using apply:
freqs2 <- apply(relative_abundance, 2, function(i) i/sum(i))
This has the advatange of being easly changed to run by rows, but the results will be joined as columns anyway, so you'd have to transpose it.

Firstly, you could just do
relative_abundance[i,j] <- data[i,j]/sum(data[i,])
so you dont create the variables...
But to vectorise it, I suggest: compute the row sums with rowsum function(fast) and then you can just use apply by columns and each of that divide by the rowsums:
relative_freq<-apply(data,2,function(x) data[,x]/rowsum(data))

Using some simple linear algebra we can produce faster results. Simply multiply on the left by a diagonal matrix with the scaling factors you need, like this:
library(Matrix)
set.seed(0)
relative_abundance <- matrix(sample(1:10, 360*375, TRUE), nrow= 375)
Diagonal_Matrix <- diag(1/rowSums(relative_abundance))
And then we multiply from the left:
row_normalized_matrix <- Diagonal_Matrix %*% relative_abundance
If you want to normalize columnwise simply make:
Diagonal_Matrix <- diag(1/colSums(relative_abundance))
and multiply from the right.

You can do something like this
relative_abundance <- matrix(sample(1:10, 360*375, TRUE), nrow= 375)
datnorm <- relative_abundance/rowSums(relative_abundance)
this will be faster if relative_abundance is a matrix rather than a data.frame

Related

'apply' on two different data frames at once in R

I am wondering whether there is a way to use apply-family function to evaluate two different dataframes at once? Or is there a better way to solve this problem? I can only think of a loop, and that is too slow:
# example data
df_model <- data.frame(DY = c(93,100,107), CC=rnorm(1:3, mean = 0.1))
df_data <- data.frame(DY = rep(c(93,100,107),each = 3), CC = c(rnorm(1:3),rnorm(1:3),rnorm(1:3)))
In this example, I would like to have a vector of three elements as output, processed as follows ( here for the first case)
#example procedure case 1
collect <- matrix(0,ncol=3,nrow=3)
collect[1,] <- dnorm( df_data[which(df_data$DY == df_model$DY[1]),]$CC, df_model[1,]$CC, log=TRUE )
as Input, I envisage
a list/vector of CCs in df_data, subsetted for by the corresponding day DY (0.07624536 1.32623789 0.92921693)
evaluated against one value (0.00049671) of df_model, on that corresponding day DY
In the end I would like to collect the vectors in example(collect) a matrix of the number of rows of three df_model$DY, and three columns, which contains the evaluation of df_data against df_model on day DY.
[,1] [,2] [,3]
[1,] -0.9218075 -1.7977334 -1.3501992
[2,] -0.9356356 -0.9850012 -1.1753341
[3,] -1.2152926 -0.9195071 -2.4127840
This needs to be done as efficiently as possible.
I can do it in a loop (above you see the first case for the loop), but I am sure there are better ways.
I looked into the apply function family, but I get confused, as I have two different dataframes which I evaluate. Any help/pointers would be much appreciated!
We can use mapply or Map
mapply(function(x, y) dnorm(df_data$CC[df_data$DY == x], y,
log = TRUE), df_model$DY, df_model$CC)
-output
# [,1] [,2] [,3]
#[1,] -1.5031401 -2.7449464 -1.734319
#[2,] -0.9237629 -0.9243094 -1.115875
#[3,] -4.9848319 -1.1494313 -1.187122

How to apply a function to array margin and create pairwise combination matrix

I am using R to apply a self-written function, that takes as an input two numeric vectors plus a numeric parameter, over column margins of data frame. Each column in data frame is a numeric vector and I want to perform pairwise computations and create a matrix which has all possible combinations of the columns with indicated result of the computation. So essentially I want to generate a behaviour similar to the one yielded by cor() function.
# Data
> head(d)
1 2 3 4
1 -1.01035342 1.2490665 0.7202516 0.101467379
2 -0.50700743 1.4356733 0.9032172 -0.001583743
3 -0.09055243 0.4695046 2.4487632 -1.082570048
4 1.11230416 0.2885735 0.3534247 -0.728574628
5 -1.96115691 0.4831158 1.5650052 0.648675605
6 1.20434218 1.7668086 0.2170858 -0.161570792
> cor(d)
1 2 3 4
1 1.00000000 0.08320968 -0.06432155 0.04909430
2 0.08320968 1.00000000 -0.04557743 -0.01092765
3 -0.06432155 -0.04557743 1.00000000 -0.01654762
4 0.04909430 -0.01092765 -0.01654762 1.00000000
I found this useful answer: Perform pairwise comparison of matrix
Based on this I wrote this function which makes use of another self-written function compareFunctions()
createProbOfNonEqMatrix <- function(df,threshold){
combinations <- combn(ncol(df),2)
predDF <- matrix(nrow = length(density(df[,1])$y)) # df creation for predicted values from density function
for(i in 1:ncol(df)){
predCol <- density(df[,i])$y # convert df of original values to df of predicted values from density function
predDF <- cbind(predDF,predCol)
}
predDF <- predDF[,2:ncol(predDF)]
colnames(predDF) <- colnames(df) # give the predicted values column names as in the original df
predDF <- as.matrix(predDF)
out.mx <- apply( X=combinations,MARGIN = 2,FUN = "compareFunctions",
predicted_by_first = predDF[,combinations[1]],
predicted_by_second = predDF[,combinations[2]],
threshold = threshold)
return(out.mx)
}
The predicted_by_first, predicted_by_second and threshold are inputs for compareFunctions. However I get the following error:
Error in FUN(newX[, i], ...) : unused argument (newX[, i])
In desperation I tried this:
createProbOfNonEqMatrix <- function(df,threshold){
combinations <- combn(ncol(df),2)
predDF <- matrix(nrow = length(density(df[,1])$y))
for(i in 1:ncol(df)){
predCol <- density(df[,i])$y
predDF <- cbind(predDF,predCol)
}
predDF <- predDF[,2:ncol(predDF)]
colnames(predDF) <- colnames(df)
predDF <- as.matrix(predDF)
out.mx <- apply(
X=combinations,MARGIN = 2,FUN = function(x) {
diff <- abs(predDF[,x[1]]-predDF[,x[2]])
boolean <- diff<threshold
acceptCount <- length(boolean[boolean==TRUE])
probability <- acceptCount/length(diff)
return(probability)
}
)
return(out.mx)
}
It does seem to be working but instead of returning the pairwise matrix it gives me a vector:
> createProbOfNonEqMatrix(d,0.001)
[1] 0.10351562 0.08203125 0.13476562 0.13085938 0.14843750 0.10937500
Will you be able to guide me on how to make the desired pairwise matrix even if it implies writing the function code again within apply()? Also, if you could give me an idea on how to keep track of what pairwise comparisons are performed it will be greatly appreciated.
Thank you,
Alex
Your output gives you the result of the calculation in the order of the pairs in combinations: (1,2), (1,3), (1,4), (2,3), (2,4), (3,4). If you want to organise this into a symmetric square matrix you can do a basic manipulation on the result, e.g. as follows:
out.mx<-c(0.10351562, 0.08203125, 0.13476562, 0.13085938, 0.14843750, 0.10937500)
out.mtx<-matrix(nrow=ncol(df1),ncol=ncol(df1))
out.mtx[,]<-1
for (i in 1:length(combinations[1,])){
a<-combinations[1,i]
b<-combinations[2,i]
out.mtx[a,b]<-out.mtx[b,a]<-out.mx[i]
}
out.mtx
which gives you
[,1] [,2] [,3] [,4]
[1,] 1.00000000 0.1035156 0.08203125 0.1347656
[2,] 0.10351562 1.0000000 0.13085938 0.1484375
[3,] 0.08203125 0.1308594 1.00000000 0.1093750
[4,] 0.13476562 0.1484375 0.10937500 1.0000000

Computing the correlation between the auto-correlation and cross-correlation for each pair of rows in a matrix

I have a matrix with dimensions m by n. For example:
m = 4
n = 10
mat = matrix(rnorm(m*n), nrow = m, ncol=n)
For a certain pair of rows i, j:
i=1
j=2
I compute the correlation between the auto-correlation of row i and the cross-correlation of rows i and j. So given:
lag=5
The auto-correlation of row i would be:
acf.i = acf(mat[i,],lag.max=lag)
the cross-correlation of rows i and j would be:
ccf.i.j = ccf(mat[i,],mat[j,],lag.max=lag)
and the correlation between acf.i and ccf.i.j would be something like:
cor.acf.i.ccf.i.j = cor(acf.i$acf,ccf.i.j$acf[(lag+1):(2*lag+1)])
(since ccf computes the correlation with lag range of: -lag:lag and acf only in the range of 0:lag I arbitrarily choose to take the range 0:lag for ccf.i.j)
What I want is to efficiently do that for each row i and each other row in in mat , over all rows of mat. I guess this function should return a matrix with dimensions m by m.
Make sure you set plot to FALSE for acf, ccf. Then, you can just wrap your code in a call to outer to provide every pair of i and j values. Note that since outer expects a vectorized FUN (e.g. *), we need to vectorize your function:
set.seed(1)
m <- 4
n <- 10
mat <- matrix(rnorm(m*n), nrow = m, ncol=n)
lag <- 5
outer(1:nrow(mat), 1:nrow(mat),
Vectorize(
function(i, j) {
acf.i <- acf(mat[i,],lag.max=lag, plot=F)
ccf.i.j <- ccf(mat[i,],mat[j,],lag.max=lag, plot=F)
cor(acf.i$acf,ccf.i.j$acf[(lag+1):(2*lag+1)])
} ) )
# [,1] [,2] [,3] [,4]
# [1,] 1.0000000 0.47035200 -0.006371955 -0.85880247
# [2,] 0.4133899 1.00000000 -0.462744858 -0.13327111
# [3,] -0.3573965 0.01882691 1.000000000 0.09358042
# [4,] -0.8570117 -0.58359258 0.249930947 1.00000000
This is relatively efficient. There may be a better algorithm than the one you use to get the same answer, but I'm not familiar enough with this stuff to provide it.

Efficient way to calculate array multiplication

Is there any efficient way to calculate 2x2 matrix H without for statement?
n=10
a=array(rnorm(n),c(2,1,n))
b=array(rnorm(n),c(2,1,n))
H=matrix(0,2,2)
for(i in 1:n) H=H+a[,,i] %*% t(b[,,i])
H=matrix(0,2,2)
for(i in 1:n) H=H+a[,,i] %*% t(b[,,i])
H
#----------
[,1] [,2]
[1,] 10.770929 -0.4245556
[2,] -5.613436 -1.7588095
H2 <-a[ ,1, ] %*% t(b[ ,1, ])
H2
#-------------
[,1] [,2]
[1,] 10.770929 -0.4245556
[2,] -5.613436 -1.7588095
This does depend on the arrays in question having one of their dimensions == 1, and on the fact that "[" will drop length-1 dimensions unless you specify drop=FALSE.
This is the same (up to FAQ 7.31 issues) as what you calculate:
In case the second dimension truly has only 1 level, you can use
tcrossprod( matrix(a,nr=2), matrix(b,nr=2) )
and more generally,
crossprod( matrix( aperm(a, c(3,1,2)), nc=2), matrix( aperm(b, c(3,1,2)), nc=2) )
If you can create 'a' and 'b' ordered so that you do not need the aperm() it will be still faster.
The relative speed of different solutions depends on the dimensions. If the first two are both big and the last one small, a loop like yours (but using crossprod) might be as quick as you can get.

How to write linearly dependent column in a matrix in terms of linearly independent columns?

I have a large mxn matrix, and I have identified the linearly dependent columns. However, I want to know if there's a way in R to write the linearly dependent columns in terms of the linearly independent ones. Since it's a large matrix, it's not possible to do based on inspection.
Here's a toy example of the type of matrix I have.
> mat <- matrix(c(1,1,0,1,0,1,1,0,0,1,1,0,1,1,0,1,0,1,0,1), byrow=TRUE, ncol=5, nrow=4)
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 0 1 0
[2,] 1 1 0 0 1
[3,] 1 0 1 1 0
[4,] 1 0 1 0 1
Here it's obvious that x3 = x1-x2, x5=x1-x4. I want to know if there's an automated way to get that for a larger matrix.
Thanks!
I'm sure there is a better way but I felt like playing around with this. I basically do a check at the beginning to see if the input matrix is full column rank to avoid unnecessary computation in case it is full rank. After that I start with the first two columns and check if that submatrix is of full column rank, if it is then I check the first thee columns and so on. Once we find some submatrix that isn't of full column rank I regress the last column in that submatrix on the previous one which tells us how to construct linear combinations of the first columns to get the last column.
My function isn't very clean right now and could do some additional checking but at least it's a start.
mat <- matrix(c(1,1,0,1,0,1,1,0,0,1,1,0,1,1,0,1,0,1,0,1), byrow=TRUE, ncol=5, nrow=4)
linfinder <- function(mat){
# If the matrix is full rank then we're done
if(qr(mat)$rank == ncol(mat)){
print("Matrix is of full rank")
return(invisible(seq(ncol(mat))))
}
m <- ncol(mat)
# cols keeps track of which columns are linearly independent
cols <- 1
for(i in seq(2, m)){
ids <- c(cols, i)
mymat <- mat[, ids]
if(qr(mymat)$rank != length(ids)){
# Regression the column of interest on the previous
# columns to figure out the relationship
o <- lm(mat[,i] ~ mat[,cols] + 0)
# Construct the output message
start <- paste0("Column_", i, " = ")
# Which coefs are nonzero
nz <- !(abs(coef(o)) <= .Machine$double.eps^0.5)
tmp <- paste("Column", cols[nz], sep = "_")
vals <- paste(coef(o)[nz], tmp, sep = "*", collapse = " + ")
message <- paste0(start, vals)
print(message)
}else{
# If the matrix subset was of full rank
# then the newest column in linearly independent
# so add it to the cols list
cols <- ids
}
}
return(invisible(cols))
}
linfinder(mat)
which gives
> linfinder(mat)
[1] "Column_3 = 1*Column_1 + -1*Column_2"
[1] "Column_5 = 1*Column_1 + -1*Column_4"

Resources