Speed optimization - calculating weighted column in data.table with distance matrix - r

I am trying to apply weights to a numeric vector in a data.table. The weights come from the euclidean distances of each point with all the other points. If a point is close with another point, then the weights assigned to them will be higher, if the distance between 2 points are greater than a threshold then the weights will be 0, the weight assigned to the distance between a point and itself is 10000.
I can illustrate with the code below:
library(data.table)
library(dplyr)
library(tictoc)
set.seed(42)
df <- data.table(
LAT = rnorm(500, 42),
LONG = rnorm(500, -72),
points = rnorm(500)
)
df2 <- copy(df) # for new solution
d <- as.matrix(dist(df[, .(LAT, LONG)])) # compute distance matrix
# function to calculate the weights
func <- function(j, cols, threshold) {
N <- which(d[j, ] <= threshold) # find points whose distances are below threshold
K <- (1 / (d[j, N] ^ 2)) # calculate weights, which are inversely proportional to distance, lower distance, higher the weight
K[which(d[j, N] == 0)] <- 10000 # weight to itself is 10000
return((K%*% as.matrix(df[N, ..cols])) / sum(K)) # compute weighted point for 1 row
}
tic('Old way')
# compute the weighted point calculation for every row
result <- tapply(1:nrow(df), 1:nrow(df), function(i) func(i, 'points', 0.5))
df[, 'weighted_points' := result] # assign the results back to data.table
toc()
The current function works well for small number of points, but it takes a lot longer to compute weighted points for about 220K rows.
I have come up with another solution that cuts down the time in half, but I think it can still be improved.
d <- as.matrix(dist(df[, .(LAT, LONG)]))
df2[, 'weighted_points' := points]
dt <- as.data.table(d)
cols <- names(dt)
tic('New way')
# compute the weights
dt[, (cols) := lapply(.SD, function(x) case_when(
x == 0 ~ 10000,
x <= 0.5 ~ 1 / (x^2),
TRUE ~ 0)), .SDcols = cols]
# compute the weighted point for each row
for (i in 1L:nrow(dt)) {
set(df2, i, 'weighted_points', value = sum(df2[['points']] * dt[[i]]) / sum(dt[[i]]))
}
toc()
round(sum(df$weighted_points - df2$weighted_points), 0)
The time differences may be small for this small data set, but I have tested the time using the real data set and the new way is quite a bit faster.
My question is, how can I make the new approach to be even faster? I know I am using case_when from dplyr which could make things slower in exchange for readability, but are there other things that I am not doing correctly in data.table that could help make it faster?

From data analyst side I think you could improve your code with an approximation for what mean distance and close points.
Once I worked with NCDC station locations and tried to find closes stations for each other because there were so many stations it was time-consuming. I came up with an idea that after I get dist of my coordinates of each point I just rank them up and put up threshold "how many stations I want to take for real weight calculation".
For example, after ranking take 50 closest points (within the rank) and put them weights respectively, other points will just get 0 weight.
Hope this helps

Related

Accumulation curves in R (not in Vegan)

I’d like to create accumulation curves, specifically metric accumulation curves, using bootstrapping and for loops. I’m interested in sampling (with replacement) the total number of plots in my example dataset, starting at 1 and working up to the total number (n=1 … max n). Each will be sampled 1000 times.
I don’t believe a package, such as Vegan, will help with this, since I’m not looking for species accumulation curves but instead need to calculate metrics based on abundance data and a plant species’ coefficient of conservatism (please correct me if I’m wrong about this!).
My example dataset is a matrix, with plots, plant species names, abundance values, and c-values (coefficients of conservatism):
https://docs.google.com/spreadsheets/d/1v-93sV4ANUXpObVbtixTo2ZQjiOKemvQ_cfubZPq9L4/edit?usp=sharing
For each of the 1000 iterations for each nth sample, I need to build a matrix that will hold the 1000 iteration results that has species name, abundance, and the c-value and then eliminate any duplicate species from that sample. For each iteration, I must then calculate vegetation metrics. Its important that I don’t calculate the metric for the entire 1000 iterations, but for each individual iteration.
I will repeat for n+1 until max n. At the end, ideally, I will then input those results into a matrix of my final results, with rows being n … max n, and 1000 columns with calculated metrics for each of those 1000 iterations. I will then average across iterations, and then create an accumulation curve of my desired metric from those averages.
The code that I thought was useful is included below, with a different example data set, including the metrics that I’m interested in calculating.
https://docs.google.com/spreadsheets/d/1GcH2aq3qZzgTv2YkN-uMpnShblgsuKxAPYKH_mLbbh8/edit?usp=sharing
d<-Example2
d<-data.matrix(d)
MEANC<-function(x){
mean(x, na.rm=TRUE)
}
FQI<-function(x){
mean(x, na.rm=TRUE)*sqrt(sum(!is.na(x)))
}
RICH<-function(x){
totalsprich<-sum(x)
sum(x!=0, na.rm=TRUE)
}
shannon <- function(x){
totalCov <- sum(x, na.rm=TRUE)
(sum(x / totalCov * log(x / totalCov), na.rm=TRUE)) * -1
}
#for this particular example, the only two functions (metrics) that will work will be RICH and shannon
nrep<-1000
totalQuads<-nrow(d)
bootResultSD<-data.frame(matrix(nrow=nrep, ncol=totalQuads) )
bootResultMean<-data.frame(matrix(nrow=nrep, ncol=totalQuads) )
for(j in 1:totalQuads){
for(i in 1:nrep){
bootIndex<-sample(1:totalQuads, j, replace=FALSE)
bootSample<-d[bootIndex, na.rm=TRUE, drop=FALSE]
VALUES<-apply(bootSample, 1, shannon)
bootResultSD[i, j]<-sd(VALUES, na.rm=TRUE)
bootResultMean[i, j]<-mean(VALUES, na.rm=TRUE)
}
}
VALUES
bootResultSD
bootResultMean
meanDATA <- apply(bootResultMean, 2, mean, na.rm=TRUE)
meanDATASD <- apply(bootResultSD[-1], 2, mean, na.rm=TRUE)
The issue with what I’ve created from before is that it is calculating metrics on a per plot basis, instead of accumulating plots and re-calculating metrics based on each cumulative sample.
Here is what I’ve come up with so far based off my code from above, but I don’t think this is what I need:
for(j in 1:totalQuads){
for(i in 1:nrep){
bootIndex<-sample(1:totalQuads, 10, replace=TRUE)
bootSample<-d[bootIndex, na.rm=TRUE, drop=FALSE]
booted<-bootSample[!duplicated(bootSample[,2]),]
bootResultSD[i, j]<-sd(booted, na.rm=TRUE)
bootResultMean[i, j]<-mean(booted, na.rm=TRUE)
}
}
I’m at a loss for how to proceed past this point. Thanks in advance!
UPDATE:
I worked with a colleague to develop an answer to my question above. Here is the code that he created.
#d<-data file
nrep <- 1000
totalQuads <- length(unique(d$Plot))
boot.result.fqi <- matrix(nrow=nrep, ncol=totalQuads)
boot.result.meanc <- matrix(nrow=nrep, ncol=totalQuads)
boot.result.shannon <- matrix(nrow=nrep, ncol=totalQuads)
boot.result.rich <- matrix(nrow=nrep, ncol=totalQuads)
for(j in 1:totalQuads){
for(i in 1:nrep){
bootIndex <- sample(1:totalQuads, j, replace=TRUE)
for(k in 1:length(bootIndex)){
if(k == 1){
bootSample <- subset(d, Plot %in% bootIndex[k])
} else {
bootSample <- rbind(bootSample, subset(d, Plot %in% bootIndex[k]))
}
}
# bootSample <- subset(d, Plot %in% bootIndex)
bootSampleUniqSp <- unique(bootSample[c("Species", "C_Value")])
## Calculate and store the results
#Richness
boot.result.rich[i,j] <- nrow(bootSampleUniqSp)
#Mean C
if(boot.result.rich[i,j] > 0){
boot.result.meanc[i,j] <- mean(bootSampleUniqSp$C_Value)
} else {
boot.result.meanc[i,j] <- 0
# This is the rule I've set up for when there are no species in the quads. Change the rule as you like
}
#FQI
boot.result.fqi[i,j] <- boot.result.meanc[i,j] * sqrt(boot.result.rich[i,j])
#Shannon
covers <- aggregate(bootSample$CoverA_1.4mplot,
by=list(bootSample$Species), sum)
# covers <- bootSample$CoverA_1.4mplot
total.cov <- sum(covers$x)
boot.result.shannon[i,j] <- (sum(covers$x / total.cov *
log(covers$x / total.cov), na.rm=TRUE)) * -1
}
}
par(mfcol=c(2,2))
boxplot(boot.result.rich, main="Richness")
boxplot(boot.result.meanc, main="Mean C")
boxplot(boot.result.fqi, main="FQI")
boxplot(boot.result.shannon, main="Shannon's index")
# the means across number of quadrats
apply(boot.result.shannon, 2, mean)
summary.dfr <- data.frame(quads=1:totalQuads,
richness=apply(boot.result.rich, 2, mean),
meanc=apply(boot.result.meanc, 2, mean),
fqi=apply(boot.result.fqi, 2, mean),
shannon=apply(boot.result.shannon, 2, mean)
)

Find adjacent rows that match condition

I have a financial time series in R (currently an xts object, but I'm also looking into tibble right now).
How do I find the probability of 2 adjacent rows matching a condition?
For example I want to know the probability of 2 consecutive days having a higher than mean/median value. I know I can lag the previous days value into the next row which would allow me to get this statistic, but that seems very cumbersome and inflexible.
Is there a better way to get this done?
xts sample data:
foo <- xts(x = c(1,1,5,1,5,5,1), seq(as.Date("2016-01-01"), length = 7, by = "days"))
What's the probability of 2 consecutive days having a higher than median value?
You can create a new column that calls out which are higher than the median, and then take only those that are consecutive and higher
> foo <- as_tibble(data.table(x = c(1,1,5,1,5,5,1), seq(as.Date("2016-01-01"), length = 7, by = "days")))
Step 1
Create column to find those that are higher than median
> foo$higher_than_median <- foo$x > median(foo$x)
Step 2
Compare that column using diff,
Take it only when both are consecutively higher or lower..c(0, diff(foo$higher_than_median) == 0
Then add the condition that they must both be higher foo$higher_than_median == TRUE
Full Expression:
foo$both_higher <- c(0, diff(foo$higher_than_median)) == 0 & $higher_than_median == TRUE
Step 3
To find probability take the mean of foo$both_higher
mean(foo$both_higher)
[1] 0.1428571
Here is a pure xts solution.
How do you define the median? There are several ways.
In an online time series use, like computing a moving average, you can compute the median over a fixed lookback window (shown below), or from the origin up to now (an anchored window calculation). You won't know future values in the median computation beyond the current time step (Avoid look ahead bias).:
library(xts)
library(TTR)
x <- rep(c(1,1,5,1,5,5,1, 5, 5, 5), 10)
y <- xts(x = x, seq(as.Date("2016-01-01"), length = length(x), by = "days"), dimnames = list(NULL, "x"))
# Avoid look ahead bias in an online time series application by computing the median over a rolling fixed time window:
nMedLookback <- 5
y$med <- runPercentRank(y[, "x"], n = nMedLookback)
y$isAboveMed <- y$med > 0.5
nSum <- 2
y$runSum2 <- runSum(y$isAboveMed, n = nSum)
z <- na.omit(y)
prob <- sum(z[,"runSum2"] >= nSum) / NROW(z)
The case where your median is over the entire data set is obviously a much easier modification of this.

Faster way to generate large list of vectors from permuted datasets [R]

Setup For the purposes of my simulation, I'm generating a list of B=2000 elements, with each element being the output of a permutation procedure in which I first permute the rows of a 200x8000 matrix and for each column, I calculate the Kolmogorov-Smirnov test statistic between the first and second 100 rows (you can think of the first 100 rows as data from one group and the second 100 rows as data from another group).
Question This process takes a very long time (about 30-40 minutes) to generate the list. Is there a much faster way? In the future, I'd like to increase B to a larger value.
Code
B=2000
n.row=200; n.col=8000
#Generate sample data
samp.dat = matrix(rnorm(n.row*n.col),nrow=n.row)
perm.KS.list = NULL
for (b in 1:B){
#permute the rows
perm.dat.tmp = samp.dat[sample(nrow(samp.dat)),]
#Compute the permutation-based test statistics
perm.KS.list[[b]]= apply(perm.dat.tmp,2,function(y) ks.test.stat(y[1:100],y[101:200]))
}
#Modified KS-test function (from base package)
ks.test.stat <- function(x,y){
x <- x[!is.na(x)]
n <- length(x)
y <- y[!is.na(y)]
n.x <- as.double(n)
n.y <- length(y)
w <- c(x, y)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)] #exclude ties
STATISTIC <- max(abs(z))
return(STATISTIC)
}
The 1:B loop has several places to optimize, but I agree that the real consumer is that inner function. Because you're simulating your well-behaved bootstrap samples, you can make two simplifying assumptions that the general base function can't:
There aren't missing values. This obviates the is.na() adjustments
The two sides (ie, x & y) have the same number of elements, so you don't need to count them separately. instead of splitting y in the loop, and them joining them back in the function (into w), just keep it together. The balanced sides also permit simplifications like remove the ifelse() clause. It produces a bunch of 0/1s, which are rescaled to -1/1s with integer arithmetic.
The function is reduced, which saves about 25% of the time. I added integers, instead of doubles inside cumsum().
ks.test.stat.balanced <- function(w){
n <- as.integer(length(w) * .5)
# z <- cumsum(ifelse(order(w) <= n, 1L, -1L)) / n
z <- cumsum((order(w) <= n)*2L - 1L) / n
# z <- z[c(which(diff(sort(w)) != 0), n + n)] #exclude ties
return( max(abs(z)) )
}
Ties shouldn't occur often with your gaussian rng, and the diff(sort(.)) is very expensive. If you're willing to remove that protection, the time is reduced by about 65%.
If you move the equation for z into abs(), it saves a little time over all those reps. I kept it separate above, so it's easier to read.
edit in case of an unbalanced simulation I'd recommend you:
still keep out the is.na,
still pass w,
still keep as much as possible in integer, not numeric, but
now include arguments n1 & n2 for the two group sizes.
Also, experiment w/ precalculating 1/n before cumsum() to avoid a lot of expensive divisions. Try to think of other math-y ways to extract calculations from an inner loop so it occurs less frequently.

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

Select the most dissimilar individual using cluster analysis [duplicate]

I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
Data example:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there …
That has become became a problem since my data has became really big.
I will appreciate any help or suggestion on how to apply the previous in R
.
I´m not sure if it is exactly what you are searching, but maybe it helps:
mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1))
rownames(mydata) <- paste0("id", 1:100) # some id for identification
# cluster objects and calculate dissimilarity matrix
cl <- cutree(hclust(
sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5)
# combine results, take sum to aggregate dissimilarity
res <- data.frame(id=rownames(mydata),
cluster=cl, dis_sim=rowSums(as.matrix(sim)))
# order, lowest overall dissimilarity will be first
res <- res[order(res$dis_sim), ]
# split object
reslist <- split(res, f=res$cluster)
## takes first three items with highest overall dissim.
lapply(reslist, tail, n=3)
## returns id´s with highest overall dissimilarity, top 20%
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)
regarding you comment, find the code below:
pleas note that the code can be improved in terms of beauty and efficiency.
Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
for calculation of centroids out of hclust see R-Mailinglist

Resources