Accumulation curves in R (not in Vegan) - r

I’d like to create accumulation curves, specifically metric accumulation curves, using bootstrapping and for loops. I’m interested in sampling (with replacement) the total number of plots in my example dataset, starting at 1 and working up to the total number (n=1 … max n). Each will be sampled 1000 times.
I don’t believe a package, such as Vegan, will help with this, since I’m not looking for species accumulation curves but instead need to calculate metrics based on abundance data and a plant species’ coefficient of conservatism (please correct me if I’m wrong about this!).
My example dataset is a matrix, with plots, plant species names, abundance values, and c-values (coefficients of conservatism):
https://docs.google.com/spreadsheets/d/1v-93sV4ANUXpObVbtixTo2ZQjiOKemvQ_cfubZPq9L4/edit?usp=sharing
For each of the 1000 iterations for each nth sample, I need to build a matrix that will hold the 1000 iteration results that has species name, abundance, and the c-value and then eliminate any duplicate species from that sample. For each iteration, I must then calculate vegetation metrics. Its important that I don’t calculate the metric for the entire 1000 iterations, but for each individual iteration.
I will repeat for n+1 until max n. At the end, ideally, I will then input those results into a matrix of my final results, with rows being n … max n, and 1000 columns with calculated metrics for each of those 1000 iterations. I will then average across iterations, and then create an accumulation curve of my desired metric from those averages.
The code that I thought was useful is included below, with a different example data set, including the metrics that I’m interested in calculating.
https://docs.google.com/spreadsheets/d/1GcH2aq3qZzgTv2YkN-uMpnShblgsuKxAPYKH_mLbbh8/edit?usp=sharing
d<-Example2
d<-data.matrix(d)
MEANC<-function(x){
mean(x, na.rm=TRUE)
}
FQI<-function(x){
mean(x, na.rm=TRUE)*sqrt(sum(!is.na(x)))
}
RICH<-function(x){
totalsprich<-sum(x)
sum(x!=0, na.rm=TRUE)
}
shannon <- function(x){
totalCov <- sum(x, na.rm=TRUE)
(sum(x / totalCov * log(x / totalCov), na.rm=TRUE)) * -1
}
#for this particular example, the only two functions (metrics) that will work will be RICH and shannon
nrep<-1000
totalQuads<-nrow(d)
bootResultSD<-data.frame(matrix(nrow=nrep, ncol=totalQuads) )
bootResultMean<-data.frame(matrix(nrow=nrep, ncol=totalQuads) )
for(j in 1:totalQuads){
for(i in 1:nrep){
bootIndex<-sample(1:totalQuads, j, replace=FALSE)
bootSample<-d[bootIndex, na.rm=TRUE, drop=FALSE]
VALUES<-apply(bootSample, 1, shannon)
bootResultSD[i, j]<-sd(VALUES, na.rm=TRUE)
bootResultMean[i, j]<-mean(VALUES, na.rm=TRUE)
}
}
VALUES
bootResultSD
bootResultMean
meanDATA <- apply(bootResultMean, 2, mean, na.rm=TRUE)
meanDATASD <- apply(bootResultSD[-1], 2, mean, na.rm=TRUE)
The issue with what I’ve created from before is that it is calculating metrics on a per plot basis, instead of accumulating plots and re-calculating metrics based on each cumulative sample.
Here is what I’ve come up with so far based off my code from above, but I don’t think this is what I need:
for(j in 1:totalQuads){
for(i in 1:nrep){
bootIndex<-sample(1:totalQuads, 10, replace=TRUE)
bootSample<-d[bootIndex, na.rm=TRUE, drop=FALSE]
booted<-bootSample[!duplicated(bootSample[,2]),]
bootResultSD[i, j]<-sd(booted, na.rm=TRUE)
bootResultMean[i, j]<-mean(booted, na.rm=TRUE)
}
}
I’m at a loss for how to proceed past this point. Thanks in advance!

UPDATE:
I worked with a colleague to develop an answer to my question above. Here is the code that he created.
#d<-data file
nrep <- 1000
totalQuads <- length(unique(d$Plot))
boot.result.fqi <- matrix(nrow=nrep, ncol=totalQuads)
boot.result.meanc <- matrix(nrow=nrep, ncol=totalQuads)
boot.result.shannon <- matrix(nrow=nrep, ncol=totalQuads)
boot.result.rich <- matrix(nrow=nrep, ncol=totalQuads)
for(j in 1:totalQuads){
for(i in 1:nrep){
bootIndex <- sample(1:totalQuads, j, replace=TRUE)
for(k in 1:length(bootIndex)){
if(k == 1){
bootSample <- subset(d, Plot %in% bootIndex[k])
} else {
bootSample <- rbind(bootSample, subset(d, Plot %in% bootIndex[k]))
}
}
# bootSample <- subset(d, Plot %in% bootIndex)
bootSampleUniqSp <- unique(bootSample[c("Species", "C_Value")])
## Calculate and store the results
#Richness
boot.result.rich[i,j] <- nrow(bootSampleUniqSp)
#Mean C
if(boot.result.rich[i,j] > 0){
boot.result.meanc[i,j] <- mean(bootSampleUniqSp$C_Value)
} else {
boot.result.meanc[i,j] <- 0
# This is the rule I've set up for when there are no species in the quads. Change the rule as you like
}
#FQI
boot.result.fqi[i,j] <- boot.result.meanc[i,j] * sqrt(boot.result.rich[i,j])
#Shannon
covers <- aggregate(bootSample$CoverA_1.4mplot,
by=list(bootSample$Species), sum)
# covers <- bootSample$CoverA_1.4mplot
total.cov <- sum(covers$x)
boot.result.shannon[i,j] <- (sum(covers$x / total.cov *
log(covers$x / total.cov), na.rm=TRUE)) * -1
}
}
par(mfcol=c(2,2))
boxplot(boot.result.rich, main="Richness")
boxplot(boot.result.meanc, main="Mean C")
boxplot(boot.result.fqi, main="FQI")
boxplot(boot.result.shannon, main="Shannon's index")
# the means across number of quadrats
apply(boot.result.shannon, 2, mean)
summary.dfr <- data.frame(quads=1:totalQuads,
richness=apply(boot.result.rich, 2, mean),
meanc=apply(boot.result.meanc, 2, mean),
fqi=apply(boot.result.fqi, 2, mean),
shannon=apply(boot.result.shannon, 2, mean)
)

Related

Vectorizing function to calculate entropy

I'm writing a function to calculate a Shannon diversity index. I have wide data with the pct of observations for each value as separate variables, with each row representing a different site. I will have between 2 and 7 variables depending on the data set. For each row I want to calculate the information index.
.
I have a loop function, but its quite slow and am looking for help to vectorize it. I'm also happy for a tidyverse style solution.
I've been looking at using the entropy package, but it seems to expect the data in long form, and while I could expand my data back out, that seems like it would be unnecessarily slow. I currently have 20k sites with 100's to 1000's of observations per site that have already been summarized into the wide format percent distributions. This question similarly works with long form data.
Example data
# Wide data, between 2 and 7 columns recording the percent of observations with each value, example using 3
df <- data.frame(
site = 1:3,
l1 = c(.33, .5, 0),
l2 = c(.33, .5, 0),
l3 = c(.33, 0, 1)
)
Current loop function
entropy <- function(df, vars) {
entropy_calc <- function(df, i, vars) {
sum <- 0
for (j in vars) {
x <- df[i,j]
if(x != 0) { # skip zeros
sum <- sum + x * log(x)
}
}
return(-sum)
}
entropy <- rep(NA, nrow(df))
for(i in 1:nrow(df)) {
entropy[i] <- entropy_calc(df, i, vars)
}
return(as.numeric(entropy))
}
df$entropy <- entropy(df, 2:4)
This can be vectorized easily because the underlying functions needed are already vectorized. You don't need to manually skip zeroes because log(0) returns -Inf and 0*log(0) returns NaN. You can omit the NaN when summing the cell values by specifying na.rm = TRUE.
entropy <- function(p) rowSums(-(p * log(p)), na.rm = TRUE)
entropy(df[,2:4])
Also check out the diversity() function in the vegan package which does essentially this, among other possibilities.

Speed optimization - calculating weighted column in data.table with distance matrix

I am trying to apply weights to a numeric vector in a data.table. The weights come from the euclidean distances of each point with all the other points. If a point is close with another point, then the weights assigned to them will be higher, if the distance between 2 points are greater than a threshold then the weights will be 0, the weight assigned to the distance between a point and itself is 10000.
I can illustrate with the code below:
library(data.table)
library(dplyr)
library(tictoc)
set.seed(42)
df <- data.table(
LAT = rnorm(500, 42),
LONG = rnorm(500, -72),
points = rnorm(500)
)
df2 <- copy(df) # for new solution
d <- as.matrix(dist(df[, .(LAT, LONG)])) # compute distance matrix
# function to calculate the weights
func <- function(j, cols, threshold) {
N <- which(d[j, ] <= threshold) # find points whose distances are below threshold
K <- (1 / (d[j, N] ^ 2)) # calculate weights, which are inversely proportional to distance, lower distance, higher the weight
K[which(d[j, N] == 0)] <- 10000 # weight to itself is 10000
return((K%*% as.matrix(df[N, ..cols])) / sum(K)) # compute weighted point for 1 row
}
tic('Old way')
# compute the weighted point calculation for every row
result <- tapply(1:nrow(df), 1:nrow(df), function(i) func(i, 'points', 0.5))
df[, 'weighted_points' := result] # assign the results back to data.table
toc()
The current function works well for small number of points, but it takes a lot longer to compute weighted points for about 220K rows.
I have come up with another solution that cuts down the time in half, but I think it can still be improved.
d <- as.matrix(dist(df[, .(LAT, LONG)]))
df2[, 'weighted_points' := points]
dt <- as.data.table(d)
cols <- names(dt)
tic('New way')
# compute the weights
dt[, (cols) := lapply(.SD, function(x) case_when(
x == 0 ~ 10000,
x <= 0.5 ~ 1 / (x^2),
TRUE ~ 0)), .SDcols = cols]
# compute the weighted point for each row
for (i in 1L:nrow(dt)) {
set(df2, i, 'weighted_points', value = sum(df2[['points']] * dt[[i]]) / sum(dt[[i]]))
}
toc()
round(sum(df$weighted_points - df2$weighted_points), 0)
The time differences may be small for this small data set, but I have tested the time using the real data set and the new way is quite a bit faster.
My question is, how can I make the new approach to be even faster? I know I am using case_when from dplyr which could make things slower in exchange for readability, but are there other things that I am not doing correctly in data.table that could help make it faster?
From data analyst side I think you could improve your code with an approximation for what mean distance and close points.
Once I worked with NCDC station locations and tried to find closes stations for each other because there were so many stations it was time-consuming. I came up with an idea that after I get dist of my coordinates of each point I just rank them up and put up threshold "how many stations I want to take for real weight calculation".
For example, after ranking take 50 closest points (within the rank) and put them weights respectively, other points will just get 0 weight.
Hope this helps

Generating two series with a certain correlation and a specific condition in R

I want to generate two data series of size 100 in R, one of which is going to be remission time, tr, from Exp(mean=1) distribution and the other one is going to be survival time, t, from Exp(mean=2.5) distribution. I want them to be negatively correlated (say, the correlation is -0.5). But at the same time I want that R avoids the values of t[i] that are less than tr[i] for data point i, because survival times should be greater than remission times. I have been able to produce some correlation between the two variables (although the correlation is not exactly reproduced) using the following codes:
rho <- -0.5
mu <- rep(0,2)
Sigma <- matrix(rho, nrow=2, ncol=2) + diag(2)*(1 - rho)
library(MASS)
rawvars <- mvrnorm(100, mu=mu, Sigma=Sigma)
pvars <- pnorm(rawvars)
tr<-rep(0,100)
for(i in 1:100){
tr[i] <- qexp(pvars[,1][i], 1/1)
}
t<-rep(0,100)
for(i in 1:100){
repeat {
t[i] <- qexp(pvars[,2][i], 1/2)
if (t[i]>tr[i]) break
}
}
cor(tr,t)
sum(tr>t) # shows number of invalid cases
But how should I efficiently induce the condition so that R only generates values of t that are greater than corresponding tr?
Moreover, is there a better way (faster way) to do the whole thing in R?
The issue here is that qexp is the quantile function and will return the same value for the same probability pvars[,2][i]. As a result, your code can easily go into an infinite loop when any one of the pvars[i,] is such that t[i]<=tr[i]. To avoid that, you must regenerate your rawvars for each t[i], tr[i] pair that fails your condition. In addition, looping over pvars is not necessary since qexp and operator > are all vectorized. The following code does what you want:
rho <- -0.5
mu <- rep(0,2)
Sigma <- matrix(rho, nrow=2, ncol=2) + diag(2)*(1 - rho)
library(MASS)
set.seed(1) ## so that results are repeatable
compute.tr.t <- function(n, paccept) {
n <- round(n / paccept)
rawvars <- mvrnorm(n, mu=mu, Sigma=Sigma)
pvars <- pnorm(rawvars)
tr <- qexp(pvars[,1], 1/1)
t <- qexp(pvars[,2], 1/2)
keep <- which(t > tr)
return(data.frame(t=t[keep],tr=tr[keep]))
}
n <- 10000 ## generating 10000 instead of 100, this can now be large
paccept <- 1
res <- data.frame()
while (n > 0) {
new.res <- compute.tr.t(n, paccept)
res <- rbind(res, new.res)
paccept <- nrow(new.res) / n
n <- n - nrow(res)
}
Notes:
The function compute.tr.t borrows a technique from rejection sampling here. Its input arguments are the requested number of samples that we want and the expected probability of acceptance. With this:
It generates n = n / paccept exponential variates for both tr and t as you do to account for the probability of acceptance
It only keeps those satisfying the condition t > tr.
What compute.tr.t returns may be less than the requested n samples. We can then use this information to compute how many more samples we need and what the updated expected probability of acceptance is.
We generate the samples satisfying our condition in a while loop. In this loop:
We call compute.tr.t with a requested number of samples to generate and the expected acceptance rate. Initially, these will be set to how many total samples we want and 1, respectively.
The result of compute.tr.t are then appended to the result data frame res.
Updating the probability of accept is simply the ratio of how many samples were returned over how many were requested.
Updating the requested number of samples is simply how many more we need from the total number we want.
We stop when the next requested number of samples is less than or equal to 0 (i.e., we have enough samples).
The resulting data frame may contain more than the total number of samples we want.
Running this code, we get:
print(cor(res$tr,res$t))
[1] -0.09128498
print(sum(res$tr>res$t)) # shows number of invalid cases
##[1] 0
We note that the anti correlation is significantly weaker than expected. This is due to your condition. If we remove this condition by modifying compute.tr.t as:
compute.tr.t <- function(n, paccept) {
n <- round(n / paccept)
rawvars <- mvrnorm(n, mu=mu, Sigma=Sigma)
pvars <- pnorm(rawvars)
tr <- qexp(pvars[,1], 1/1)
t <- qexp(pvars[,2], 1/2)
return(data.frame(t=t,tr=tr))
}
Then we get:
print(cor(res$tr,res$t))
##[1] -0.3814602
print(sum(res$tr>res$t)) # shows number of invalid cases
##[1] 3676
The correlation is now much more reasonable, but the number of invalid cases is significant.

Select the most dissimilar individual using cluster analysis

I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
Data example:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there …
That has become became a problem since my data has became really big.
I will appreciate any help or suggestion on how to apply the previous in R
.
I´m not sure if it is exactly what you are searching, but maybe it helps:
mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1))
rownames(mydata) <- paste0("id", 1:100) # some id for identification
# cluster objects and calculate dissimilarity matrix
cl <- cutree(hclust(
sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5)
# combine results, take sum to aggregate dissimilarity
res <- data.frame(id=rownames(mydata),
cluster=cl, dis_sim=rowSums(as.matrix(sim)))
# order, lowest overall dissimilarity will be first
res <- res[order(res$dis_sim), ]
# split object
reslist <- split(res, f=res$cluster)
## takes first three items with highest overall dissim.
lapply(reslist, tail, n=3)
## returns id´s with highest overall dissimilarity, top 20%
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)
regarding you comment, find the code below:
pleas note that the code can be improved in terms of beauty and efficiency.
Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
for calculation of centroids out of hclust see R-Mailinglist

Select the most dissimilar individual using cluster analysis [duplicate]

I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
Data example:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there …
That has become became a problem since my data has became really big.
I will appreciate any help or suggestion on how to apply the previous in R
.
I´m not sure if it is exactly what you are searching, but maybe it helps:
mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1))
rownames(mydata) <- paste0("id", 1:100) # some id for identification
# cluster objects and calculate dissimilarity matrix
cl <- cutree(hclust(
sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5)
# combine results, take sum to aggregate dissimilarity
res <- data.frame(id=rownames(mydata),
cluster=cl, dis_sim=rowSums(as.matrix(sim)))
# order, lowest overall dissimilarity will be first
res <- res[order(res$dis_sim), ]
# split object
reslist <- split(res, f=res$cluster)
## takes first three items with highest overall dissim.
lapply(reslist, tail, n=3)
## returns id´s with highest overall dissimilarity, top 20%
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)
regarding you comment, find the code below:
pleas note that the code can be improved in terms of beauty and efficiency.
Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
for calculation of centroids out of hclust see R-Mailinglist

Resources