Suppose we have dataset G2:
data(iris)
G2 <- iris[1:5, -5]
We need to calculate Euclidean distance between x (row in G2) and G2 (excluding x) for all x's in G2, formally
I wonder what is the best way to to this. Here is my initial attempt:
D <- dist(G2)
m1 <- as.matrix(D)
(1 / (5 - 1)) * colSums(m1)
Your notation is a bit confusing because you use D differently in the code and formula. How about
m <- as.matrix(dist(G2, upper=T))
D <- apply(m, 2, mean)
n <- length(D)
D <- n/(n-1)*D
Related
I have a matrix of species occurring in sites and I want to compute the following formula for each pair ab of species:
where Ra and Rb are the occurrences of species a and b respectively and S the number of sites where a and b co-occur.
So far, I have this solution which is very slow (actually way too slow for my matrix):
set.seed(1)
# Example of binary matrix with sites in rows and species in columns
mat <- matrix(runif(200), ncol = 20)
mat_bin <- mat
mat_bin[mat_bin > 0.5] <- 1
mat_bin[mat_bin <= 0.5] <- 0
rownames(mat_bin) <- paste0("site_", seq(1:nrow(mat_bin)))
colnames(mat_bin) <- paste0("sp_", seq(1:ncol(mat_bin)))
# Number of occurrences for every species
nbocc <- colSums(mat_bin)
# Number of cooccurrences between species
S <- crossprod(mat_bin)
diag(S) <- 0
# Data frame with all the pair combinations
comb <- data.frame(t(combn(colnames(mat_bin), 2)))
colnames(comb) <- c("sp1", "sp2")
comb$Cscore <- 0
# Slow for_loop to compute the Cscore of each pair
for(i in 1:nrow(comb)){
num <- (nbocc[[comb[i, "sp1"]]] - S[comb[i, "sp1"], comb[i, "sp2"]]) *
(nbocc[[comb[i, "sp2"]]] - S[comb[i, "sp1"], comb[i, "sp2"]])
denom <- nbocc[[comb[i, "sp1"]]] * nbocc[[comb[i, "sp2"]]]
comb[i, "Cscore"] <- num/denom
}
A first solution could be to parallelize the for-loop, but maybe a more optimized solution exist.
Like you have started with S, you could do the full calculation in a vectorized manner based on matrices.
This would look as follows:
set.seed(1)
# Example of binary matrix with sites in rows and species in columns
mat <- matrix(runif(200), ncol = 20)
mat_bin <- mat
mat_bin[mat_bin > 0.5] <- 1
mat_bin[mat_bin <= 0.5] <- 0
rownames(mat_bin) <- paste0("site_", seq(1:nrow(mat_bin)))
colnames(mat_bin) <- paste0("sp_", seq(1:ncol(mat_bin)))
# Number of occurrences for every species
nbocc <- colSums(mat_bin)
# Number of cooccurrences between species
S <- crossprod(mat_bin)
resMat <- (nbocc - S) * t(nbocc - S) /
outer(nbocc, nbocc, `*`)
# in the end you would need just the triangle
resMat[lower.tri(resMat)]
let's say I have n igraphs objects g1, g2,.., gn. They are undirected and weighted graphs, i.e. new weight's attribute should be added. I'd like to union n graphs into the weighted graph g.
It is known from the documentation (see ?graph.union) if the n graphs have the weight attribute, it is renamed by adding a _1 and _2 (and _3, etc.) suffix, i.e. weight_1, weight_2,..., weight_n.
I have seen the answer and wrote the code for n=3 graphs (see below).
Edited:
library(igraph)
rm(list=ls(all=TRUE)) # delete all objects
g1 <- graph_from_literal(A1-B1-C1)
g2 <- graph_from_literal(A2-B2-C2)
g3 <- graph_from_literal(A3-B3-C3)
E(g1)$weight <- c(1, 2)
E(g2)$weight <- c(3, 4)
E(g3)$weight <- c(5, 6)
g <- union(g1, g2, g3)
new_attr <- as.list(list.edge.attributes(g))
k <- length(new_attr) # number of new attributes
value_new_attr <- lapply(list.edge.attributes(g),
function(x) get.edge.attribute(g,x))
df <- data.frame()
for (i in 1:k) {df <- rbind(df, value_new_attr[[i]])}
E(g)$weight <- colSums(df, na.rm=TRUE)
g <- delete_edge_attr(g, "weight_1") # 1
g <- delete_edge_attr(g, "weight_2") # 2
g <- delete_edge_attr(g, "weight_3") # 3
Question. How to rewrite the last tree commands with the lapply() function?
My attempt does not work:
g <- lapply(value_new_attr, function(x) {g <- delete_edge_attr(g, x)})
I have found the solution with for-loop
# delete edge attributes with suffix
for (i in 1:k) {g <- delete_edge_attr(g, new_attr[i])}
As of now I am computing some features from a large matrix and doing it all in a for-loop. As expected it's very slow. I have been able to vectorize part of the code, but I'm stuck on one part.
I would greatly appreciate some advice/help!
s1 <- MyMatrix #dim = c(5167,256)
fr <- MyVector #vector of length 256
tw <- 5
fw <- 6
# For each point S(t,f) we need the sub-matrix of points S_hat(i,j),
# i in [t - tw, t + tw], j in [f - fw, f + fw] for the feature vector.
# To avoid edge effects, I pad the original matrix with zeros,
# resulting in a matrix of size nobs+2*tw x nfreqs+2*fw
nobs <- dim(s1)[1] #note: this is 5167
nf <- dim(s1)[2] #note: this is 256
sp <- matrix(0, nobs+2*tw, nf+2*fw)
t1 <- tw+1; tn <- nobs+tw
f1 <- fw+1; fn <- nf+fw
sp[t1:tn, f1:fn] <- s1 # embed the actual matrix into the padding
nfeatures <- 1 + (2*tw+1)*(2*fw+1) + 1
fsp <- array(NaN, c(dim(sp),nfeatures))
for (t in t1:tn){
for (f in f1:fn){
fsp[t,f,1] <- fr[(f - f1 + 1)] #this part I can vectorize
fsp[t,f,2:(nfeatures-1)] <- as.vector(sp[(t-tw):(t+tw),(f-fw):(f+fw)]) #this line is the problem
fsp[t,f,nfeatures] <- var(fsp[t,f,2:(nfeatures-1)])
}
}
fspec[t1:tn, f1:fn, 1] <- t(matrix(rep(fr,(tn-t1+1)),ncol=(tn-t1+1)))
#vectorized version of the first feature ^
return(fsp[t1:tn, f1:fn, ]) #this is the returned matrix
I assume that the var feature will be easy to vectorize after the 2nd feature is vectorized
I have three large matrices: I, G, and G^2. These are 4Million x 4Million matrices and they are sparse. I would like to check if they are linearly independent and I would like to do this in R.
For small matrices, a way to this is to vectorize each matrix: stack columns on top of each other and test if the matrix formed by the three stacked vectors has rank three.
However, due to the size of my problem I am not sure how to proceed.
(1) Is there a way to vectorize a Large Sparse Matrix into a Very Large Sparse Vector in R?
(2) Is there any other solution to the problem that could make this test efficient ?
Thanks in advance
When converting your matrices to vectors, you can keep only the non-zero elements.
# Sample data
n <- 4e6
k <- n
library(Matrix)
I <- spMatrix(n, n, 1:n, 1:n, rep(1,n))
G <- spMatrix(n, n,
sample(1:n, k, replace=TRUE),
sample(1:n, k, replace=TRUE),
sample(0:9, k, replace=TRUE)
)
G2 <- G %*% G
G2 <- as(G2, "dgTMatrix") # For the j slot
# Only keep elements that are non-zero in one of the 3 matrices
i <- as.integer( c(G#i, G2#i, I#i) + 1 )
j <- as.integer( c(G#j, G2#j, I#j) + 1 )
ij <- cbind(i,j)
rankMatrix( cbind( G2[ij], G[ij], I[ij] ) ) # 3
# Another example
m <- ceiling(n/2)-1
G <- spMatrix(n, n,
c(1:n, 2*(1:m)),
c(1:n, 2*(1:m)+1),
rep(1, n+m)
)
G2 <- as(G %*% G, "dgTMatrix")
i <- c(G#i, G2#i, I#i) + 1
j <- c(G#j, G2#j, I#j) + 1
ij <- cbind(i,j)
rankMatrix( cbind( G2[ij], G[ij], I[ij] ) ) # 2
(To speed things up, you could take only a small part of those vectors:
if the rank is already 3, you know that they are independent,
if it is 2, you can check if the linear dependence relation also holds for the large vectors.)
To sweep out colMeans, rowMeans and mean from columns, rows and observations respectively I use the following code:
a <- matrix(data=seq(from=2, to=60, by=2), nrow=6, ncol=5, byrow=FALSE)
b <- matrix(data=rep(colMeans(a), nrow(a)), nrow=nrow(a), ncol=ncol(a), byrow=TRUE)
c <- matrix(data=rep(rowMeans(a), ncol(a)), nrow=nrow(a), ncol=ncol(a), byrow=FALSE)
d <- matrix(data=rep(mean(a), nrow(a)*ncol(a)), nrow=nrow(a), ncol=ncol(a), byrow=FALSE)
e <- a-b-c-d
colMeans can be sweep out by using this command
a1 <- sweep(a, 2, colMeans(a), "-")
Is there any single command to sweep out colMeans, rowMeans and mean? Thanks in advance.
What do you think eshould look like in this example? Perhaps your line should be e <- a-b-c+d so that e has zero mean.
The following code produces the same result as your calculation using b, c, and d (with your arithmetic progression example, a matrix of 0s). Change +to - if you insist.
e <- t(t(a) - colMeans(a)) - rowMeans(a) + mean(a)
Not that I know of, but why not just write your own? It's only four lines:
meanSweep <- function(x){
tmp <- sweep(x,2,colMeans(x),"-")
tmp <- sweep(tmp,1,rowMeans(x),"-")
tmp <- tmp - mean(x)
tmp
}
all.equal(e,meanSweep(a))
[1] TRUE