R, compute index based on co-occurrence matrix

R, compute index based on co-occurrence matrix - r

I have a matrix of species occurring in sites and I want to compute the following formula for each pair ab of species:
where Ra and Rb are the occurrences of species a and b respectively and S the number of sites where a and b co-occur.
So far, I have this solution which is very slow (actually way too slow for my matrix):
set.seed(1)
# Example of binary matrix with sites in rows and species in columns
mat <- matrix(runif(200), ncol = 20)
mat_bin <- mat
mat_bin[mat_bin > 0.5] <- 1
mat_bin[mat_bin <= 0.5] <- 0
rownames(mat_bin) <- paste0("site_", seq(1:nrow(mat_bin)))
colnames(mat_bin) <- paste0("sp_", seq(1:ncol(mat_bin)))
# Number of occurrences for every species
nbocc <- colSums(mat_bin)
# Number of cooccurrences between species
S <- crossprod(mat_bin)
diag(S) <- 0
# Data frame with all the pair combinations
comb <- data.frame(t(combn(colnames(mat_bin), 2)))
colnames(comb) <- c("sp1", "sp2")
comb$Cscore <- 0
# Slow for_loop to compute the Cscore of each pair
for(i in 1:nrow(comb)){
num <- (nbocc[[comb[i, "sp1"]]] - S[comb[i, "sp1"], comb[i, "sp2"]]) *
(nbocc[[comb[i, "sp2"]]] - S[comb[i, "sp1"], comb[i, "sp2"]])
denom <- nbocc[[comb[i, "sp1"]]] * nbocc[[comb[i, "sp2"]]]
comb[i, "Cscore"] <- num/denom
}
A first solution could be to parallelize the for-loop, but maybe a more optimized solution exist.

Like you have started with S, you could do the full calculation in a vectorized manner based on matrices.
This would look as follows:
set.seed(1)
# Example of binary matrix with sites in rows and species in columns
mat <- matrix(runif(200), ncol = 20)
mat_bin <- mat
mat_bin[mat_bin > 0.5] <- 1
mat_bin[mat_bin <= 0.5] <- 0
rownames(mat_bin) <- paste0("site_", seq(1:nrow(mat_bin)))
colnames(mat_bin) <- paste0("sp_", seq(1:ncol(mat_bin)))
# Number of occurrences for every species
nbocc <- colSums(mat_bin)
# Number of cooccurrences between species
S <- crossprod(mat_bin)
resMat <- (nbocc - S) * t(nbocc - S) /
outer(nbocc, nbocc, `*`)
# in the end you would need just the triangle
resMat[lower.tri(resMat)]

Related

Plot cumulative value for different series

I have run a short simulation and want to plot the outcomes of each simulation in terms of the "running sum" over parameter k. For reference, I want to end up with a plot that looks similar to the ones in this article:
https://www.pinnacle.com/en/betting-articles/Betting-Strategy/betting-bankroll-management/VDM2GY6UX3B552BG
The following is the code for the simulation:
## Simulating returns over k bets.
odds <- 1.5
k <- 100
return <- odds - 1
edge <- 0.04
pw <- 1/(odds/(1-edge))
pl <- 1-pw
nsims <- 10000
set.seed(42)
sims <- replicate(nsims, {
x <- sample(c(-1,return), k, TRUE, prob=c(pl, pw))
})
rownames(sims) <- c(1:k)
colnames(sims) <- c(1:nsims)
If I wasn't being clear in the description let me know.

Okay so here is how you can achieve the plot of the cumulative value over bets (I set nsims <- 10 so that the plot is readable).
First I generate the data :
## Simulating returns over k bets.
odds <- 1.5
k <- 100
return <- odds - 1
edge <- 0.04
pw <- 1/(odds/(1-edge))
pl <- 1-pw
nsims <- 10
set.seed(42)
sims <- replicate(nsims, {
x <- sample(c(-1,return), k, TRUE, prob=c(pl, pw))
})
rownames(sims) <- c(1:k)
colnames(sims) <- c(1:nsims)
Then I create a dataframe containing the results of the n simulations (10 here) :
df <- as.data.frame(sims)
What we want to plot is the cumulative sum, not the result at a specific bet so we iterate through the columns (i.e. the simulations) to have that value :
for (i in colnames(df)){
df[[i]] <- cumsum(df[[i]])
}
df <- mutate(df, bets = rownames(df))
output <- melt(df, id.vars = "bets", variable.name = 'simulation')
Now we can plot our data :
ggplot(output, aes(bets,value,group=simulation)) + geom_line(aes(colour = simulation))

R: Stuck on a "simple" problem: calculating total sum of squares in a n*m matrix

Given a data matrix with n rows and m columns, I would like to calculate the total sum of squares in R.
For this I've tried a loop that iterates through the rows of each column and saves the results in a vector. These are then added to the "TSS" vector where each value is the SS of one column. The sum of this vector should be the TSS.
set.seed(2020)
m <- matrix(c(sample(1:100, 80)), nrow = 40, ncol = 2)
tss <- c()
for(j in 1:ncol(m)){
tssVec <- c()
for(i in 1:nrow(m)){
b <- sum(((m[i,]) - mean(m[,j]))^2)
tssVec <- c(tssVec, b)
}
tss <- c(tss, sum(tssVec))
}
sum(tss)
The output is equal to 136705.6. This is not feasible at all. As a novice coder, I am unfortunately stuck.
Any help is appreciated!

There are many methods to evaluate the TSS, of course they will give you the same result. I would do something like:
Method 1 that implies the use of ANOVA:
n <- as.data.frame(m)
mylm <- lm(n$V1 ~ n$V2)
SSTotal <-sum(anova(mylm)[,2])
Method 2:
SSTotal <- var( m[,1] ) * (nrow(m)-1)

Euclidean distance for each row in dataset

Suppose we have dataset G2:
data(iris)
G2 <- iris[1:5, -5]
We need to calculate Euclidean distance between x (row in G2) and G2 (excluding x) for all x's in G2, formally
I wonder what is the best way to to this. Here is my initial attempt:
D <- dist(G2)
m1 <- as.matrix(D)
(1 / (5 - 1)) * colSums(m1)

Your notation is a bit confusing because you use D differently in the code and formula. How about
m <- as.matrix(dist(G2, upper=T))
D <- apply(m, 2, mean)
n <- length(D)
D <- n/(n-1)*D

Speeding up this tricky matrix calculation

As of now I am computing some features from a large matrix and doing it all in a for-loop. As expected it's very slow. I have been able to vectorize part of the code, but I'm stuck on one part.
I would greatly appreciate some advice/help!
s1 <- MyMatrix #dim = c(5167,256)
fr <- MyVector #vector of length 256
tw <- 5
fw <- 6
# For each point S(t,f) we need the sub-matrix of points S_hat(i,j),
# i in [t - tw, t + tw], j in [f - fw, f + fw] for the feature vector.
# To avoid edge effects, I pad the original matrix with zeros,
# resulting in a matrix of size nobs+2*tw x nfreqs+2*fw
nobs <- dim(s1)[1] #note: this is 5167
nf <- dim(s1)[2] #note: this is 256
sp <- matrix(0, nobs+2*tw, nf+2*fw)
t1 <- tw+1; tn <- nobs+tw
f1 <- fw+1; fn <- nf+fw
sp[t1:tn, f1:fn] <- s1 # embed the actual matrix into the padding
nfeatures <- 1 + (2*tw+1)*(2*fw+1) + 1
fsp <- array(NaN, c(dim(sp),nfeatures))
for (t in t1:tn){
for (f in f1:fn){
fsp[t,f,1] <- fr[(f - f1 + 1)] #this part I can vectorize
fsp[t,f,2:(nfeatures-1)] <- as.vector(sp[(t-tw):(t+tw),(f-fw):(f+fw)]) #this line is the problem
fsp[t,f,nfeatures] <- var(fsp[t,f,2:(nfeatures-1)])
}
}
fspec[t1:tn, f1:fn, 1] <- t(matrix(rep(fr,(tn-t1+1)),ncol=(tn-t1+1)))
#vectorized version of the first feature ^
return(fsp[t1:tn, f1:fn, ]) #this is the returned matrix

I assume that the var feature will be easy to vectorize after the 2nd feature is vectorized

Linear Independence of Large Sparse Matrices in R

I have three large matrices: I, G, and G^2. These are 4Million x 4Million matrices and they are sparse. I would like to check if they are linearly independent and I would like to do this in R.
For small matrices, a way to this is to vectorize each matrix: stack columns on top of each other and test if the matrix formed by the three stacked vectors has rank three.
However, due to the size of my problem I am not sure how to proceed.
(1) Is there a way to vectorize a Large Sparse Matrix into a Very Large Sparse Vector in R?
(2) Is there any other solution to the problem that could make this test efficient ?
Thanks in advance

When converting your matrices to vectors, you can keep only the non-zero elements.
# Sample data
n <- 4e6
k <- n
library(Matrix)
I <- spMatrix(n, n, 1:n, 1:n, rep(1,n))
G <- spMatrix(n, n,
sample(1:n, k, replace=TRUE),
sample(1:n, k, replace=TRUE),
sample(0:9, k, replace=TRUE)
)
G2 <- G %*% G
G2 <- as(G2, "dgTMatrix") # For the j slot
# Only keep elements that are non-zero in one of the 3 matrices
i <- as.integer( c(G#i, G2#i, I#i) + 1 )
j <- as.integer( c(G#j, G2#j, I#j) + 1 )
ij <- cbind(i,j)
rankMatrix( cbind( G2[ij], G[ij], I[ij] ) ) # 3
# Another example
m <- ceiling(n/2)-1
G <- spMatrix(n, n,
c(1:n, 2*(1:m)),
c(1:n, 2*(1:m)+1),
rep(1, n+m)
)
G2 <- as(G %*% G, "dgTMatrix")
i <- c(G#i, G2#i, I#i) + 1
j <- c(G#j, G2#j, I#j) + 1
ij <- cbind(i,j)
rankMatrix( cbind( G2[ij], G[ij], I[ij] ) ) # 2
(To speed things up, you could take only a small part of those vectors:
if the rank is already 3, you know that they are independent,
if it is 2, you can check if the linear dependence relation also holds for the large vectors.)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R, compute index based on co-occurrence matrix - r

Related

Plot cumulative value for different series

R: Stuck on a "simple" problem: calculating total sum of squares in a n*m matrix

Euclidean distance for each row in dataset

Speeding up this tricky matrix calculation

Linear Independence of Large Sparse Matrices in R

Categories

Resources