Vectorizing function to calculate entropy - r

I'm writing a function to calculate a Shannon diversity index. I have wide data with the pct of observations for each value as separate variables, with each row representing a different site. I will have between 2 and 7 variables depending on the data set. For each row I want to calculate the information index.
.
I have a loop function, but its quite slow and am looking for help to vectorize it. I'm also happy for a tidyverse style solution.
I've been looking at using the entropy package, but it seems to expect the data in long form, and while I could expand my data back out, that seems like it would be unnecessarily slow. I currently have 20k sites with 100's to 1000's of observations per site that have already been summarized into the wide format percent distributions. This question similarly works with long form data.
Example data
# Wide data, between 2 and 7 columns recording the percent of observations with each value, example using 3
df <- data.frame(
site = 1:3,
l1 = c(.33, .5, 0),
l2 = c(.33, .5, 0),
l3 = c(.33, 0, 1)
)
Current loop function
entropy <- function(df, vars) {
entropy_calc <- function(df, i, vars) {
sum <- 0
for (j in vars) {
x <- df[i,j]
if(x != 0) { # skip zeros
sum <- sum + x * log(x)
}
}
return(-sum)
}
entropy <- rep(NA, nrow(df))
for(i in 1:nrow(df)) {
entropy[i] <- entropy_calc(df, i, vars)
}
return(as.numeric(entropy))
}
df$entropy <- entropy(df, 2:4)

This can be vectorized easily because the underlying functions needed are already vectorized. You don't need to manually skip zeroes because log(0) returns -Inf and 0*log(0) returns NaN. You can omit the NaN when summing the cell values by specifying na.rm = TRUE.
entropy <- function(p) rowSums(-(p * log(p)), na.rm = TRUE)
entropy(df[,2:4])
Also check out the diversity() function in the vegan package which does essentially this, among other possibilities.

Related

Combining for loops and ifelse in R

I am trying to write a for loop that will generate a correlation for a fixed column (LPS0) vs. all other columns in the data set. I don't want to use a correlation matrix because I only care about the correlation of LPS0 vs all other columns, not the correlations of the other columns with themselves. I then want to include an if statement to print only the significant correlations (p.value <= 0.05). I ran into some issues where some of the p.values are returned as NA, so I switched to an if_else loop. However, I am now getting an error. My code is as follows:
for(i in 3:ncol(microbiota_lps_0_morm)) {
morm_0 <- cor.test(microbiota_lps_0_morm$LPS0, microbiota_lps_0_morm[[colnames(microbiota_lps_0_morm)[i]]], method = "spearman")
if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
}
The first value is returned, and then the loop stops with the following error:
Error in if_else():
! true must be length 1 (length of condition), not 8.
Backtrace: 1. dplyr::if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
How can I make the loop print morm only when p.value <- 0.05?
Here's a long piece of code which aytomates the whole thing. it might be overkill but you can just take the matrix and use whatever you need. it makes use of the tidyverse.
df <- select_if(mtcars,is.numeric)
glimpse(df)
# keeping real names
dict <- cbind(original=names(df),new=paste0("v",1:ncol(df)))
# but changing names for better data viz
colnames(df) <- paste0("v",1:ncol(df))
# correlating between variables + p values
pvals <- list()
corss <- list()
for (coln in colnames(df)) {
pvals[[coln]] <- map(df, ~ cor.test(df[,coln], .)$p.value)
corss[[coln]] <- map(df, ~ cor(df[,coln], .))
}
# Keeping both matrices in a list
matrices <- list(
pvalues = matrix(data=unlist(pvals),
ncol=length(names(pvals)),
nrow=length(names(pvals))),
correlations = matrix(data=unlist(corss),
ncol=length(names(corss)),
nrow=length(names(corss)))
)
rownames(matrices[[1]]) <- colnames(df)
rownames(matrices[[2]]) <- colnames(df)
# Creating a combined data frame
long_cors <- expand.grid(Var1=names(df),Var2=names(df)) %>%
mutate(cor=unlist(matrices["correlations"]),
pval=unlist(matrices["pvalues"]),
same=Var1==Var2,
significant=pval<0.05,
dpcate=duplicated(cor)) %>%
# Leaving no duplicants, non-significant or self-correlation results
filter(same ==F,significant==T,dpcate==F) %>%
select(-c(same,dpcate,significant))
# Plotting correlations
long_cors %>%mutate(negative=cor<0) %>%
ggplot(aes(x=Var1,y=Var2,
color=negative,size=abs(cor),fill=Var2,
label=round(cor,2)))+
geom_label(show.legend = F,alpha=0.2)+
scale_color_manual(values = c("black","darkred"))+
# Sizing each correlation by it's magnitude
scale_size_area(seq(1,100,length=length(unique(long_cors$Var1))))+ theme_light()+
theme(axis.text = element_text(face = "bold",size=12))+
labs(title="Correlation between variables",
caption = "p < 0.05")+xlab("")+ylab("")
If you want to correlate a column of a matrix with the remaining columns, you can do so with one function call:
mtx <- matrix(rnorm(800), ncol=8)
cor(mtx[,1], mtx[,-1])
However, you will not get p-values. For getting p-values, I would recommend this approach:
library(tidyverse)
significant <- map_dbl(2:ncol(mtx),
~ cor.test(mtx[,1], mtx[,.], use="p", method="s")$p.value)
Whenever you feel like you need a for loop in R, chances are, you should be using another approach. for is a very un-R construct, and R gives many better ways of handling the same issues. map_* family of functions from tidyverse is but one of them. Another approach, in base R, would be to use apply:
significant <- apply(mtx[,-1], 2,
\(x) cor.test(x, mtx[,1], method="s", use="p")$p.value)

How can I make my for loop in R run faster? Can I vectorize this?

#Start: Initialize values
#For each block lengths (BlockLengths) I will run 10 estimates (ThetaL). For each estimate, I simulate 50000 observarions (Obs). Each estimate is calculated on the basis of the blocklength.
Index=0 #Initializing Index.
ThetaL=10 #Number of estimations of Theta.
Obs=50000 #Sample size.
Grp=vector(length=7) #Initializing a vector of number of blocks. It is dependent on block lengths (see L:15)
Theta=matrix(data=0,nrow=ThetaL,ncol=7) #Initializing a matrix of the estimates of Thetas. There are 10 for each block length.
BlockLengths<-c(10,25,50,100,125,200,250) #Setting the block lengths
for (r in BlockLengths){
Index=Index+1
Grp[Index]=Obs/r
for (k in 1:ThetaL){
#Start: Constructing the sample
Y1<-matrix(data=0,nrow=Obs,ncol=2)
Y1[1,]<-runif(2,0,1)
Y1[1,1]<--log(-(Y1[1,1])^2 +1)
Y1[1,2]<--log(-(Y1[1,2])^2 +1)
for (i in 2:Obs)
{
Y1[i,1]<-Y1[i-1,2]
Y1[i,2]<-runif(1,0,1)
Y1[i,2]<--log(-(Y1[i,2])^2 +1)
}
X1 <- vector(length=Obs)
for (i in 1:Obs){
X1[i]<-max(Y1[i,])
}
#End: Constructing the sample
K=0 #K will counts number of blocks with at least one exceedance
for (t in 1:Grp[Index]){ #For loop from 1 to number of groups
a=0
for (j in (1+r*(t-1)):(t*r)){ #Loop for the sample within each group
if (X1[j]>quantile(X1,0.99)){ #If a value exceeds high threshold, we add 1 to some variable a
a=a+1
}
}
if(a>=1){ #For the group, if a is larger than 1, we have had a exceedance.
K=K+1 #Counts number of blocks with at least one exceedance.
}
}
N<-sum(X1>=quantile(X1,0.99)) #Summing number of exceedances
Theta[k,Index]<- (1/r) * ((log(1-K/Grp[Index])) / (log(1-N/Obs))) #Estimate
#Theta[k,Index]<-K/N
}
}
I have been running the above code without errors and it took me about 20 minutes, but I want to run the code for larger sample and more repetitions, which makes the run time absurdly large. I tried to only have the necessary part inside the loops to optimize it a little. Is it possible to optimize it even further or should I use another programming language as I've read R is bad for "for loop". Will vectorization help? In case, how can I vectorize the code?
First, you can define BlockLengths before Grp and Theta as both of them depend on it's length:
Index = 0
ThetaL = 2
Obs = 10000
BlockLengths = c(10,25)
Grp = vector(length = length(BlockLengths))
Theta = matrix(data = 0, nrow = ThetaL, ncol = length(BlockLengths))
Obs: I decreased the size of the operation so that I could run it faster. With this specification, your original loop took 24.5 seconds.
Now, for the operation, there where three points where I could improve:
Creation of Y1: the second column can be generated at once, just by creating Obs random numbers with runif(). Then, the first column can be created as a lag of the second column. With only this alteration, the loop ran in 21.5 seconds (12% improvement).
Creation of X1: you can vectorise the max function with apply. This alteration saved further 1.5 seconds (6% improvement).
Calculation of K: you can, for each t, get all the values of X1[(1+r*(t-1)):(t*r)], and run the condition on all of them at once (instead of using the second loop). The any(...) does the same as your a>=1. Furthermore, you can remove the first loop using lapply vectorization function, then sum this boolean vector, yielding the same result as your combination of if(a>=1) and K=K+1. The usage of pipes (|>) is just for better visualization of the order of operations. This by far is the more important alteration, saving more 18.4 seconds (75% improvement).
for (r in BlockLengths){
Index = Index + 1
Grp[Index] = Obs/r
for (k in 1:ThetaL){
Y1 <- matrix(data = 0, nrow = Obs, ncol = 2)
Y1[,2] <- -log(-(runif(Obs))^2 + 1)
Y1[,1] <- c(-log(-(runif(1))^2 + 1), Y1[-Obs,2])
X1 <- apply(Y1, 1, max)
K <- lapply(1:Grp[Index], function(t){any(X1[(1+r*(t-1)):(t*r)] > quantile(X1,0.99))}) |> unlist() |> sum()
N <- sum(X1 >= quantile(X1, 0.99))
Theta[k,Index] <- (1/r) * ((log(1-K/Grp[Index])) / (log(1-N/Obs)))
}
}
Using set.seed() I got the same results as your original loop.
A possible way to improve more is substituting the r and k loops with purrr::map function.

Accumulation curves in R (not in Vegan)

I’d like to create accumulation curves, specifically metric accumulation curves, using bootstrapping and for loops. I’m interested in sampling (with replacement) the total number of plots in my example dataset, starting at 1 and working up to the total number (n=1 … max n). Each will be sampled 1000 times.
I don’t believe a package, such as Vegan, will help with this, since I’m not looking for species accumulation curves but instead need to calculate metrics based on abundance data and a plant species’ coefficient of conservatism (please correct me if I’m wrong about this!).
My example dataset is a matrix, with plots, plant species names, abundance values, and c-values (coefficients of conservatism):
https://docs.google.com/spreadsheets/d/1v-93sV4ANUXpObVbtixTo2ZQjiOKemvQ_cfubZPq9L4/edit?usp=sharing
For each of the 1000 iterations for each nth sample, I need to build a matrix that will hold the 1000 iteration results that has species name, abundance, and the c-value and then eliminate any duplicate species from that sample. For each iteration, I must then calculate vegetation metrics. Its important that I don’t calculate the metric for the entire 1000 iterations, but for each individual iteration.
I will repeat for n+1 until max n. At the end, ideally, I will then input those results into a matrix of my final results, with rows being n … max n, and 1000 columns with calculated metrics for each of those 1000 iterations. I will then average across iterations, and then create an accumulation curve of my desired metric from those averages.
The code that I thought was useful is included below, with a different example data set, including the metrics that I’m interested in calculating.
https://docs.google.com/spreadsheets/d/1GcH2aq3qZzgTv2YkN-uMpnShblgsuKxAPYKH_mLbbh8/edit?usp=sharing
d<-Example2
d<-data.matrix(d)
MEANC<-function(x){
mean(x, na.rm=TRUE)
}
FQI<-function(x){
mean(x, na.rm=TRUE)*sqrt(sum(!is.na(x)))
}
RICH<-function(x){
totalsprich<-sum(x)
sum(x!=0, na.rm=TRUE)
}
shannon <- function(x){
totalCov <- sum(x, na.rm=TRUE)
(sum(x / totalCov * log(x / totalCov), na.rm=TRUE)) * -1
}
#for this particular example, the only two functions (metrics) that will work will be RICH and shannon
nrep<-1000
totalQuads<-nrow(d)
bootResultSD<-data.frame(matrix(nrow=nrep, ncol=totalQuads) )
bootResultMean<-data.frame(matrix(nrow=nrep, ncol=totalQuads) )
for(j in 1:totalQuads){
for(i in 1:nrep){
bootIndex<-sample(1:totalQuads, j, replace=FALSE)
bootSample<-d[bootIndex, na.rm=TRUE, drop=FALSE]
VALUES<-apply(bootSample, 1, shannon)
bootResultSD[i, j]<-sd(VALUES, na.rm=TRUE)
bootResultMean[i, j]<-mean(VALUES, na.rm=TRUE)
}
}
VALUES
bootResultSD
bootResultMean
meanDATA <- apply(bootResultMean, 2, mean, na.rm=TRUE)
meanDATASD <- apply(bootResultSD[-1], 2, mean, na.rm=TRUE)
The issue with what I’ve created from before is that it is calculating metrics on a per plot basis, instead of accumulating plots and re-calculating metrics based on each cumulative sample.
Here is what I’ve come up with so far based off my code from above, but I don’t think this is what I need:
for(j in 1:totalQuads){
for(i in 1:nrep){
bootIndex<-sample(1:totalQuads, 10, replace=TRUE)
bootSample<-d[bootIndex, na.rm=TRUE, drop=FALSE]
booted<-bootSample[!duplicated(bootSample[,2]),]
bootResultSD[i, j]<-sd(booted, na.rm=TRUE)
bootResultMean[i, j]<-mean(booted, na.rm=TRUE)
}
}
I’m at a loss for how to proceed past this point. Thanks in advance!
UPDATE:
I worked with a colleague to develop an answer to my question above. Here is the code that he created.
#d<-data file
nrep <- 1000
totalQuads <- length(unique(d$Plot))
boot.result.fqi <- matrix(nrow=nrep, ncol=totalQuads)
boot.result.meanc <- matrix(nrow=nrep, ncol=totalQuads)
boot.result.shannon <- matrix(nrow=nrep, ncol=totalQuads)
boot.result.rich <- matrix(nrow=nrep, ncol=totalQuads)
for(j in 1:totalQuads){
for(i in 1:nrep){
bootIndex <- sample(1:totalQuads, j, replace=TRUE)
for(k in 1:length(bootIndex)){
if(k == 1){
bootSample <- subset(d, Plot %in% bootIndex[k])
} else {
bootSample <- rbind(bootSample, subset(d, Plot %in% bootIndex[k]))
}
}
# bootSample <- subset(d, Plot %in% bootIndex)
bootSampleUniqSp <- unique(bootSample[c("Species", "C_Value")])
## Calculate and store the results
#Richness
boot.result.rich[i,j] <- nrow(bootSampleUniqSp)
#Mean C
if(boot.result.rich[i,j] > 0){
boot.result.meanc[i,j] <- mean(bootSampleUniqSp$C_Value)
} else {
boot.result.meanc[i,j] <- 0
# This is the rule I've set up for when there are no species in the quads. Change the rule as you like
}
#FQI
boot.result.fqi[i,j] <- boot.result.meanc[i,j] * sqrt(boot.result.rich[i,j])
#Shannon
covers <- aggregate(bootSample$CoverA_1.4mplot,
by=list(bootSample$Species), sum)
# covers <- bootSample$CoverA_1.4mplot
total.cov <- sum(covers$x)
boot.result.shannon[i,j] <- (sum(covers$x / total.cov *
log(covers$x / total.cov), na.rm=TRUE)) * -1
}
}
par(mfcol=c(2,2))
boxplot(boot.result.rich, main="Richness")
boxplot(boot.result.meanc, main="Mean C")
boxplot(boot.result.fqi, main="FQI")
boxplot(boot.result.shannon, main="Shannon's index")
# the means across number of quadrats
apply(boot.result.shannon, 2, mean)
summary.dfr <- data.frame(quads=1:totalQuads,
richness=apply(boot.result.rich, 2, mean),
meanc=apply(boot.result.meanc, 2, mean),
fqi=apply(boot.result.fqi, 2, mean),
shannon=apply(boot.result.shannon, 2, mean)
)

how to make a fast pairwise Tanimoto distance function in R

I have a data.frame of items identified by an integer property ID, which is also the row number of the data.frame.
Each item has a vector of features FP associated to it. The elements of each FP are unique (within that FP). So for instance c(1,2,7) but never c(1,7,7).
The Tanimoto distance between any two ID's is defined as 1 minus the number of unique elements in the intersection of their FP's, divided by the number of unique elements in the union of their FP's.
I need to calculate such distances in the context of a 'maxmin' algorithm. See for instance this blog post.
The most important point to note is that I must NOT compute a full distance matrix (even with the best algorithms it would be unfeasible on the scale of datasets I am working with).
As explained in the above post, the strength of the iterative maxmin picker according to Roger Sayle's method is that one can avoid computing most of the pairwise distances, and instead calculate only the few relevant ones. Hence my question.
Here's what I could come up with so far:
# make a random dataset
set.seed(1234567)
d <- sample(30:45, 1000, replace = T)
dd <- setNames(data.frame(do.call(rbind, sapply(d,function(n) list(sample(as.character(1:(45*2)), n, replace = F)), simplify = F))), "FP")
dd["ID"] <- 1:NROW(dd)
# define a pairwise distance function for ID's
distfun <- function(ID1,ID2) {
FP1 <- dd$FP[[ID1]]
FP2 <- dd$FP[[ID2]]
int <- length(intersect(FP1,FP2))
1 - int/(d[ID1]+d[ID2]-int)
}
# test performance of distance function
x <- sample(dd$ID, 200, replace = F)
y <- sample(dd$ID[!(dd$ID %in% x)], 200, replace = F)
pairwise.dist <- NULL
system.time(
for(i in x) {
for (j in y) {
dij <- distfun(i,j)
#pairwise.dist <- rbind(pairwise.dist,c(min(i,j),max(i,j),dij))
}
}
)
# user system elapsed
# 0.86 0.00 0.86
Question 1 : do you think the distance function could be made faster?
I tried making a sparse matrix of the features (ddu.tab in the code below, where I omitted the denominator, which is trivial to compute from the intersection) and defining the distance function as vector operations, but that was much slower (a bit to my surprise, I must say).
ddu <- do.call(rbind, sapply(dd$ID, function(x) {data.frame("ID"=x, "FP"=dd$FP[[x]], stringsAsFactors = F)}, simplify = F))
ddu.tab <- xtabs(~ID+FP, ddu, sparse = T)
system.time(
for(i in x) {
for (j in y) {
dij <- t(ddu.tab[i,]) %*% ddu.tab[j,]
#pairwise.dist <- rbind(pairwise.dist,c(min(i,j),max(i,j),dij))
}
}
)
# user system elapsed
# 32.35 0.03 32.66
Question 2 : actually less important than the distance calculation, but if anyone can advise... The update of pairwise.dist by rbind is (apparently) very costly. I don't know if I can do it differently (meaning not adding new elements at each iteration), because in the maxmin application the pairs of ID's whose distances are to be calculated are not known upfront like in this example, and pairwise.dist is continuously read and appended new elements.
Someone in the past suggested to me that lists may be better than matrices for read/write. If that is the case, I could write out pairwise.dist as a named list.
BTW, just FYI, in this specific example the full distance matrix is calculated quite fast:
system.time(ddu.dist <- dist(ddu.tab, method = "binary"))
# user system elapsed
# 0.61 0.00 0.61
which seems to indicate that there is indeed a fast method to calculate binary distances.
If anyone could please advise and/or point me to relevant resources, it would be great.
Thanks!
Not sure about speeding up the distance function itself, but you could replace your double loop, using the tidyverse, with
library(tidyverse)
results <- crossing(x = x, y = y) %>% #all x,y combinations
filter(x < y) %>% #remove duplicates
mutate(pairwise.dist = map2_dbl(x, y, distfun)) #apply distance function

Generate numbers with specific correlation [with only positive values in the output]

I want to obtain a dataframe with simulated values which have a specific correlation to each other.
I need to use this function, but in the returned output there are also negative values, which do not have meaning for my purposes:
COR <- function (n, xmean, xsd, ymean, ysd, correlation) {
x <- rnorm(n)
y <- rnorm(n)
z <- correlation * scale(x)[,1] + sqrt(1 - correlation^2) *
scale(resid(lm(y ~ x)))[,1]
xresult <- xmean + xsd * scale(x)[,1]
yresult <- ymean + ysd * z
data.frame(x=xresult,y=yresult)
}
Please note that my question starts from this previous post (currently closed):
another similar discussion
Is there a method able to exclude from the final output all the rows which have at least one negative value? (in another terms, x and y must be always positives).
I spent many hours without any concrete result.....
Filtering rows which have at least one negative value can be done with the apply function, e.g.
df <- simcor(100, 1, 1, 1, 1, 0.8)
filter <- apply(df, 1, function(x) sum(x < 0) > 0)
df <- df[!filter,]
plot(df)
First, I create a dataframe df from your funcion. Then, I apply the function sum(x < 0) > 0 rowwise to the dataframe (the second argument of apply, 1 indicates to go along the first dimension of the dataframe or array). This will create a logical vector that is TRUE for every row with at least one negative value. Subsetting the dataframe with the inverse of that (!filter) leaves you with all rows that have no negative values.
UPDATE:
Seems like the package VineCopula offers functions to create distributions with a given correlation. However, I did not dive into the math as deep so I was not able to fully grasp how copulas (i.e. multivariate probability distributions) work. Using this package, you can at least create e.g. two gaussian distributions.
library(VineCopula)
BC <- BiCop(family = 1, par = 0.9)
sim <- BiCopSim(N = 1000, obj = BC)
cor(sim[,1], sim[,2])
plot(sim)
You might be able to then scale the resulting matrix to achieve a certain standard derivation.

Resources