Matrix/array multiplication in R data frame - r

I am trying to to do a matrix multiplication from a data frame in R.
The multiplication is such a way that 1st array is all elements from each column from data frame (1x6). Then we multiply it with a correlation matrix (6x6) and again with a transpose of 1st array (6x1) to give a final result. This has to be done on all rows of a data frame
here is the image of how i do it in excel
snapshot of calculation in excel
A <- c(2,3,4,5,6)
B <- c(4,5,6,7,8)
C <- c(6,7,8,9,10)
D <- c(8,9,10,11,12)
E <- c(10,11,12,13,14)
F <- c(12,13,14,15,16)
df <- data.frame (A,B,C,D,E,F)
## 6x6 correlation matrix
corr <- matrix(
c(1,0,0,0,0,0,
0,1,0,0,0,0,
0,0,1,.6,.6,.5,
0,0,.6,1,.6,.7,
0,0,.6,.6,1,.6,
0,0,.5,.7,.6,1),
nrow = 6,ncol =6, byrow = TRUE)
What i need is to add another col in df with result for row 1 = [2,4,6,8,10,12]* corr * transpose[2,4,6,8,10,12]

The following one-liner solves it.
Note that in R vectors are column vectors so the transpose is on the left side of the multiplication.
apply(df, 1, \(x) t(x) %*% corr %*% x)
#> [1] 940.0 1167.2 1420.8 1700.8 2007.2
Created on 2023-02-18 with reprex v2.0.2

Related

Apply concordance dataframe to zoo objects

I have a zoo object made of several time series, like this:
indices <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-30'),by="year")
a <- zoo(rnorm(5), order.by=indices)
b <- zoo(rnorm(5), order.by=indices)
c <- zoo(rnorm(5), order.by=indices)
ts_origin <- merge(a,b,c)
I would like to multiply each zoo series from ts_origin by a ratio contained in a dataframe, an put
the results in another zoo object (ts_final) that contains the time seris d,e,f. In other words,
the dataframe is a concordance file between a,b,c and d,e,f , and the ratio would be applied this way:
ts_final$d = ts_origin$a * 10 ; ts_final$e = ts_origin$b * 100 ; ts_final$f = ts_origin$c * 1000.
df <- data.frame(original = c("a","b","c"),
final = c("d","e","f"),
ratio = c(10,100,1000))
indices <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-30'),by="year")
d <- zoo(, order.by=indices)
e <- zoo(, order.by=indices)
f <- zoo(, order.by=indices)
ts_final <- merge(d,e,f)
Not too sure what the best approach for this. I was trying with the apply function, but couldn't make
it work... any help would be greatly appreciated!
1) Map/merge
Use Map to iterate over final, original and ratio executing the products required producing a list of zoo objects L. Note that Map takes the names from the first argument after fun. Then merge the list components forming zoo object ts_final.
fun <- function(f, o, r) ts_origin[, o] * r
L <- with(df, Map(fun, final, original, ratio))
ts_final <- do.call("merge", L)
The result using the inputs shown in the Note at the end is this zoo object:
> ts_final
d e f
2000-01-01 -5.6047565 46.09162 400.7715
2001-01-01 -2.3017749 -126.50612 110.6827
2002-01-01 15.5870831 -68.68529 -555.8411
2003-01-01 0.7050839 -44.56620 1786.9131
2004-01-01 1.2928774 122.40818 497.8505
2005-01-01 17.1506499 35.98138 -1966.6172
2) sweep
Another approach is to sweep out the ratios setting the names appropriately giving the same result as in (1).
with(df, sweep(setNames(ts_origin[, original], final), 2, ratio, "*"))
3) rep
Set the names and multiply by ratio repeated appropriately giving the same result as in (1).
nr <- nrow(df)
with(df, setNames(ts_origin[, original], final) * rep(ratio, each = nr))
Note
We can define the input reproducibly like this:
set.seed(123)
tt <- as.Date(ISOdate(2000:2005, 1, 1))
m <- matrix(rnorm(6*3), 6, dimnames = list(NULL, c("a", "b", "c")))
ts_origin <- zoo(m, tt)
df <- data.frame(original = c("a","b","c"),
final = c("d","e","f"),
ratio = c(10,100,1000))
Here is a one-liner, with wrong final names.
ts_final <- t(df$ratio * t(ts_origin))
ts_final
# a b c
#2000-01-01 -5.382213 -12.64773 -513.6408
#2001-01-01 -9.218280 -98.55123 -1826.6430
#2002-01-01 2.114663 -28.58910 290.8008
#2003-01-01 -3.576460 -23.47314 -166.5473
#2004-01-01 6.490508 -36.29317 -398.0389
#2005-01-01 -5.382213 -12.64773 -513.6408
Now assign final names.
colnames(ts_final) <- df$final

Equivalent of rowsum function for Matrix-class (dgCMatrix)

For the base R matrix class we have the rowsum function, which is very fast for computing column sums across groups of rows.
Is there an equivalent function or approach implemented in the Matrix-package?
I'm particularly interested in a fast alternative to rowsum for large dgCMatrix-objects (i.e. millions of rows, but roughly 95% sparse).
I know this is an old question, but Matrix::rowSums might be the function you are looking for.
The DelayedArray BioConductor package now has a rowsum function that accepts sparse matrices that has been very fast when I tried it.
Here is an approach using matrix multiplication, based on an example in https://slowkow.com/notes/sparse-matrix/. First, let's create a sparse matrix to play with,
library(magrittr)
library(forcats)
library(stringr)
library(Matrix)
set.seed(42)
m <- sparseMatrix(
i = sample(x = 1e4, size = 1e4),
j = sample(x = 1e4, size = 1e4),
x = rnorm(n = 1e4)
)
colnames(m) <- str_c("col", seq(ncol(m)))
rownames(m) <- str_c("row", seq(nrow(m)))
and a grouping vector defining which rows to sum,
group <- sample(1:10, nrow(m), replace = TRUE) %>%
paste0("new_row", .) %>%
fct_inorder
Whether group is a factor and its level order will affect the final row order in the merged matrix. I made group a factor with levels ordered by first appearance in group to make the row order resemble that from the rowsum() operation with reorder = FALSE.
Next, we create a (sparse) matrix that we can left-multiply by m to get a version of m whose rows have been summed based on group,
group_mat <- sparse.model.matrix(~ 0 + group) %>% t
# Adjust row names to get the correct final row names
rownames(group_mat) <- rownames(group_mat) %>% str_extract("(?<=^group).+")
msum <- group_mat %*% m
The result matches base::rowsum() on the dense version of the matrix,
d <- as.matrix(m)
dsum <- rowsum(d, group, reorder = FALSE)
all.equal(as.matrix(msum), dsum)
#> [1] TRUE
but the sparse-matrix multiplication method is much faster,
bench::mark( msum <- group_mat %*% m )$median
#> [1] 344µs
bench::mark( dsum <- rowsum(d, group) )$median
#> [1] 146ms

Extract information from one matrix through another matrix

I have 2 matrices, one is species x traits and the second one is site x species (presence/absence). I need a third matrix sites x traits and in each column, I will have more than one value (all the values for all species of one site). How can I do this? Extract information of one matrix through another matrix? I am just a beginner in R...
I transposed the site x species and cbind the 2 matrices, but the result was all columns in one matrix...
trait <- read.table("trait_matrix_final.txt", head=T, sep="\t", dec=',', row.names=1)
com <- read.table("community_matrix2.txt", head=T, sep="\t", dec=',', row.names=1)
comt <- t(com)
new <- cbind(trait, comt)
And I tried to multiply both matrices, but it is not possible because I have continuous and categorical data.
EDIT:
Complementary comments: I have continuous (eg. body size) and categoricals variables (a daily activity with the values: nocturnal, diurnal or both). So, if I have 3 species in site 1, I want to obtain mean body size for these 3 species for site 1. For the categorical variable, if the 3 species have these values: species 1= nocturnal, species 2= nocturnal and species 3 =diurnal, the column will be something like that: nocturnal+diurnal or nocturnal.diurnal. My third matrix will have the same numbers of columns that in the 1st matrix (species x traits), but the traits are averaged across all species for the particular site.
It would be very useful to provide a reproducible example so SO community can help you in solving the problem.
AFTER EDIT:
You should store the data in an object of class matrix only if all entries of that matrix are of the same class (e.g. all numeric or all character). Because your first matrix has both numeric and character values it is better to format it as a data.frame. See this post for more info.
I will generate some data assuming you have 5 traits per species, 20 species per site, and 10 sites:
n.traits <- 5
n.species <- 20
n.sites <- 10
traits.names <- paste ("trait", 1:n.traits, sep = "_")
species.names <- paste ("spec", 1:n.species, sep = "_")
sites.names <- paste ("site", 1:n.sites, sep = "_")
# species*traits matrix
set.seed (4)
mat1 <- as.data.frame (matrix (replicate (n = n.traits, rnorm (n = n.species)), nrow = n.species, ncol = n.traits, dimnames = list (species.names, traits.names)))
mat1
set.seed (89)
mat1[, 2] <- sample (x = c ("diurnal", "nocturnal"), size = nrow (mat1), replace = T)
mat1
# site*species matrix
set.seed (6)
mat2 <- matrix (replicate (n = n.species, rbinom (n = n.sites, size = 1, prob = 0.8)), nrow = n.sites, ncol = n.species, dimnames = list (sites.names, species.names))
mat2
Following for loop will average traits across species for each site:
# sites*traits matrix
mat3 <- as.data.frame (matrix (NA, nrow = n.sites, ncol = n.traits, dimnames = list (sites.names, traits.names)))
for (i in 1:n.sites){
spec_per_site_boolean <- mat2[i, ] == 1
mat1_subset <- mat1[spec_per_site_boolean, ]
for (j in 1:n.traits){
if (is.numeric (mat1_subset[,j]))
mat3[i,j] <- mean (mat1_subset[,j])
else
mat3[i,j] <- paste (sort (unique(mat1_subset[,j])), collapse = ".")
}
}
mat3
Note that the third matrix has the same number of columns as the first one (e.g. ncol (mat1) == ncol (mat3)), but it doesn't have the same number of rows (e.g. nrow (mat1) != nrow (mat3)).

Spearman correlation between two matrices of same dimensions

I have two matrices of equal dimensions (p and e) and I would like to make a spearman correlation between columns of the same name. I want to have the output of pair correlations in a matrix (M)
I used the corr.test() function from library Psych and here is what I did:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(p[,rs],e[,rs],method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}
But I get an error message:
Error in 1:ncol(y) : argument of length 0
Could you please show me what is wrong? or suggest another method?
No need for all this looping and indexing etc:
# test data
p <- matrix(data = rnorm(100),nrow = 10)
e <- matrix(data = rnorm(100),nrow = 10)
cor <- corr.test(p, e, method="spearman", adjust="none")
data.frame(name=colnames(p), r=diag(cor$r), p=diag(cor$p))
# name r p
#a a 0.36969697 0.2930501
#b b 0.16363636 0.6514773
#c c -0.15151515 0.6760652
# etc etc
If the names of the matrices don't already match, then match them:
cor <- corr.test(p, e[,match(colnames(p),colnames(e))], method="spearman", adjust="none")
Since the two matrices are huge, it would take very long system.time to execute the function corr.test() on all possible pairs but the loop that finally worked is as follow:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(as.data.frame(p[,rs]),as.data.frame(e[,rs]),
method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}

How do you find the sample sizes used in calculations on r?

I am running correlations between variables, some of which have missing data, so the sample size for each correlation are likely different. I tried print and summary, but neither of these shows me how big my n is for each correlation. This is a fairly simple problem that I cannot find the answer to anywhere.
like this..?
x <- c(1:100,NA)
length(x)
length(x[!is.na(x)])
you can also get the degrees of freedom like this...
y <- c(1:100,NA)
x <- c(1:100,NA)
cor.test(x,y)$parameter
But I think it would be best if you show the code for how your are estimating the correlation for exact help.
Here's an example of how to find the pairwise sample sizes among the columns of a matrix. If you want to apply it to (certain) numeric columns of a data frame, combine them accordingly, coerce the resulting object to matrix and apply the function.
# Example matrix:
xx <- rnorm(3000)
# Generate some NAs
vv <- sample(3000, 200)
xx[vv] <- NA
# reshape to a matrix
dd <- matrix(xx, ncol = 3)
# find the number of NAs per column
apply(dd, 2, function(x) sum(is.na(x)))
# tack on some column names
colnames(dd) <- paste0("x", seq(3))
# Function to find the number of pairwise complete observations
# among all pairs of columns in a matrix. It returns a data frame
# whose first two columns comprise all column pairs
pairwiseN <- function(mat)
{
u <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
h <- expand.grid(x = u, y = u)
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
h$n <- mapply(f, h[, 1], h[, 2])
h
}
# Call it
pairwiseN(dd)
The function can easily be improved; for example, you could set h <- expand.grid(x = u[-1], y = u[-length(u)]) to cut down on the number of calculations, you could return an n x n matrix instead of a three-column data frame, etc.
Here is a for-loop implementation of Dennis' function above to output an n x n matrix rather than have to pivot_wide() that result. On my databricks cluster it cut the compute time for 1865 row x 69 column matrix down from 2.5 - 3 minutes to 30-40 seconds.
Thanks for your answer Dennis, this helped me with my work.
pairwise_nxn <- function(mat)
{
cols <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
nn <- data.frame(matrix(nrow = length(cols), ncol = length(cols)))
rownames(nn) <- colnames(nn) <- cols
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
for (i in 1:nrow(nn))
for (j in 1:ncol(nn))
nn[i,j] <- f(rownames(nn)[i], colnames(nn)[j])
nn
}
If your variables are vectors named a and b, would something like sum(is.na(a) | is.na(b)) help you?

Resources