R: Comparison between two hexbins with applying KL divergence - r

Suppose I have two data sets with different sizes, each data set contains x and y to locate each observation.
set.seed(1)
x1 <- runif(1000,-195.5,195.5)
y1 <- runif(1000,-49,49)
data1 <- data.frame(x1,y1)
x2 <- runif(2000,-195.5,195.5)
y2 <- runif(2000,-49,49)
data2 <- data.frame(x2,y2)
Here I generated two data sets with random locations within an specific area.
Then I generated two hexbins of each dataset. And I know for achieving tracing back the bins, I need to set IDs = TRUE
hbin_1 <- hexbin(x=data1$x1,y=data1$y1,xbins=30,shape=98/391,IDs=TRUE)
hbin_2 <- hexbin(x=data2$x2,y=data2$y2,xbins=30,shape=98/391,IDs=TRUE)
In next step, I wanna apply KL divergence to achieve comparison of these two datasets. Then the problem is how can I get the matching bin in second data set to the first data set? (I wanna compare the bins with same location in two different datasets)
UPDATE
We can get the table contains the cell name(bin number) with corresponding count of observations in this bin by
tI1 <- table(hbin_1#cID)
tI2 <- table(hbin_2#cID)
The problem is the bin numbers are different between dataset1 and dataset2. Even we set same xbins and shape in the function hexbin, we still get different bins of two datasets. Then how can I compare the two datasets (or get bins with same location)?

The function hexbin doesn't not return empty bins. Hence, even we set the xbins, xbnds and ybnds same, the returned hexbin results can be different for two datasets.
We can use kde2d from the package MASS to achieve two-dimensional kernel density estimation.
b1 <- kde2d(data1$x1,data1$y1,lims = c(xbnds,ybnds))
b2 <- kde2d(data2$x2,data2$y2,lims = c(xbnds,ybnds))
Then, we can get two vectors of kernel density estimation of two datasets, and then normalising the results by dividing by the sum of each vector of the estimated density. Finally, we can apply KL divergence to quantify the similarity of the distributions.
z1 <- as.vector(b1$z)
z2 <- as.vector(b2$z)
z1 <- mapply("/",z1,0.01509942)
z2 <- mapply("/",z2,0.01513236)
kullback.leibler(z1, z2)

Related

vegan::betadisper() extract distance and error associated with centroid

I am trying to construct a meta regression to look at distance between centroids across multiple independent monitoring datasets. To build that model, for each dataset I need to extract the distance to each centroid (each dataset has the same two grouping variables -- before, after), the number of points that went into calculating the centroid (n), and the standard deviation associated with each distance to centroid (sd). I'm using vegan::betadisper() to calculate the distance to each centroid, but I am not sure whether it is possible to extract a single unit of standard deviation associated with the centroid?
I've modified the dune dataset below as sample code. The 'Use' grouping variable has two levels: before, after.
rm(list=ls())
library (vegan)
library(dplyr)
# Species and environmental data
dune2.spe <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.spe.txt', row.names = 1)
dune2.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.env.txt', row.names = 1)
data (dune) # matrix with species data (20 samples in rows and 30 species in columns)
data (dune.env)# matix of environmental variables (20 samples in rows and 5 environmental variables in columns)
#select two grouping levels for 'use'
dune_data <- cbind(dune2.spe,dune2.env)%>%
filter(Use=='Pasture'|Use=='Hayfield')
dune_data$Use <- recode_factor(dune_data$Use, 'Pasture'='Before')
dune_data$Use <- recode_factor(dune_data$Use, 'Hayfield'='After')
dune_sp <- dune_data%>%
dplyr::select(1:28)
dune_en <- dune_data%>%
dplyr::select(29:33)
#transform relative species counts
dune_rel <- decostand(dune_sp, method = "hellinger")
dune_distmat <- vegdist(dune_rel, method = "bray", na.rm=T)
(dune_disper <- betadisper(dune_distmat, type="centroid", group=dune_en$Use))
plot(dune_disper, label=FALSE)
I am trying to arrive at the following output:
Group
before_distance
n_before
sd_before
after_distance
n_after
sd_after
Dune
0.4009
5
?
0.4314
7
?

Is it possible to include NA for one missing value, and can I verify that each individual corresponds with my specified function

I have code here that generates a random spatial distribution of points, returns a distance column between every point and an infected individual and uses a function to calculate the probability of infection in the next time step. There are 60 hosts, one of which is infected. I would like to bind the values of Pi (which calculates infection probability) to my data frame with the original co-ordinates. Obviously one point is removed from the distance matrix, the infected individual. This value I would like to replace with NA in the main data frame as the next step in my code, and also to confirm that the co-ordinates correspond with the output of the function Pi.
So as it stands I am trying to attach a column of 59 rows to the main data frame of 60 rows.
# Create a spatial distribution with infected individuals
xcoord <- sample(1:100,60)
ycoord <- sample(1:100,60)
infectionstatus <- rep(0,60)
Df <- data.frame(xcoord, ycoord, infectionstatus)
a <- sample(1:60, 1)
Df$infectionstatus[a] <- 1
# Calculate distance between infected individuals and susceptibles
library(rdist)
distances <- pdist(Df[,1:2], metric = "euclidean")
position_infected_individual <- which(Df[,3]==1)
distance_from_infected <- distances[-(position_infected_individual), position_infected_individual]
#Assign parameter values and calculate probability of infection
beta<-100
alpha<-0.1
Pi<-vector()
for (p in 1:length(distance_from_infected)){
Pi[p] = 1-exp(-beta*exp(-alpha*distance_from_infected[p]))
}
The obvious step is:
replace:
distance_from_infected <- distances[-(position_infected_individual), position_infected_individual]
with:
distance_from_infected <- c(NA, distances[-(position_infected_individual), position_infected_individual])
But you're setting yourself up for quite a few failures.
Assuming only one infected case
That the DF can always be appropriately sorted so infected individual is first
That NA makes "sense" for this kind of numeric summary

Generating Random Variables with given correlations between pairs of them:

I want to generate 2 continuous random variables Q1, Q2 (quantitative traits, each are normal) and 2 binary random variables Z1, Z2 (binary traits) with given pairwise correlations between all possible pairs of them.
Say
(Q1,Q2):0.23
(Q1,Z1):0.55
(Q1,Z2):0.45
(Q2,Z1):0.4
(Q2,Z2):0.5
(Z1,Z2):0.47
Please help me generate such data in R.
This is crude but might get you started in the right direction.
library(copula)
options(digits=3)
probs <- c(0.5,0.5)
corrs <- c(0.23,0.55,0.45,0.4,0.5,0.47) ## lower triangle
Simulate correlated values (first two quantitative, last two transformed to binary)
sim <- function(n,probs,corrs) {
tmp <- normalCopula( corrs, dim=4 , "un")
getSigma(tmp) ## test
x <- rCopula(1000, tmp)
x2 <- x
x2[,3:4] <- qbinom(x[,3:4],size=1,prob=rep(probs,each=nrow(x)))
x2
}
Test SSQ distance between observed and target correlations:
objfun <- function(corrs,targetcorrs,probs,n=1000) {
cc <- try(cor(sim(n,probs,corrs)),silent=TRUE)
if (is(cc,"try-error")) return(NA)
sum((cc[lower.tri(cc)]-targetcorrs)^2)
}
See how bad things are when input corrs=target:
cc0 <- cor(sim(1000,probs=probs,corrs=corrs))
cc0[lower.tri(cc0)]
corrs
objfun(corrs,corrs,probs=probs) ## 0.112
Now try to optimize.
opt1 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5))
opt1$value ## 0.0208
Stops after 501 iterations with "max iterations exceeded". This will never work really well because we're trying to use a deterministic hill-climbing algorithm on a stochastic objective function ...
cc1 <- cor(sim(1000,probs=c(0.5,0.5),corrs=opt1$par))
cc1[lower.tri(cc1)]
corrs
Maybe try simulated annealing?
opt2 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5),
method="SANN")
It doesn't seem to do much better than the previous value. Two possible problems (left as an exercise for the reader are) (1) we have specified a set of correlations that are not feasible with the marginal distributions we have chosen, or (2) the error in the objective function surface is getting in the way -- to do better we would have to average over more replicates (i.e. increase n).

K-means algorithm variation with minimum measure of size

I'm looking for some algorithm such as k-means for grouping points on a map into a fixed number of groups, by distance.
The number of groups has already been decided, but the trick part (at least for me) is to meet the criteria that the sum of MOS of each group should in the certain range, say bigger than 1. Is there any way to make that happen?
ID MOS X Y
1 0.47 39.27846 -76.77101
2 0.43 39.22704 -76.70272
3 1.48 39.24719 -76.68485
4 0.15 39.25172 -76.69729
5 0.09 39.24341 -76.69884
I was intrigued by your question but was unsure how you might introduce some sort of random process into a grouping algorithm. Seems that the kmeans algorithm does indeed give different results if you permutate your dataset (e.g. the order of the rows). I found this bit of info here. The following script demonstrates this with a random set of data. The plot shows the raw data in black and then draws a segment to the center of each cluster by permutation (colors).
Since I'm not sure how your MOS variable is defined, I have added a random variable to the dataframe to illustrate how you might look for clusterings that satisfy a given criteria. The sum of MOS is calculated for each cluster and the result is stored in the MOS.sums object. In order to reproduce a favorable clustering, you can use the random seed value that was used for the permutation, which is stored in the seeds object. You can see that the permutations result is several different clusterings:
set.seed(33)
nsamples=500
nperms=10
nclusters=3
df <- data.frame(x=runif(nsamples), y=runif(nsamples), MOS=runif(nsamples))
MOS.sums <- matrix(NaN, nrow=nperms, ncol=nclusters)
colnames(MOS.sums) <- paste("cluster", 1:nclusters, sep=".")
rownames(MOS.sums) <- paste("perm", 1:nperms, sep=".")
seeds <- round(runif(nperms, min=1, max=10000))
plot(df$x, df$y)
COL <- rainbow(nperms)
for(i in seq(nperms)){
set.seed(seeds[i])
ORD <- sample(nsamples)
K <- kmeans(df[ORD,1:2], centers=nclusters)
MOS.sums[i,] <- tapply(df$MOS[ORD], K$cluster, sum)
segments(df$x[ORD], df$y[ORD], K$centers[K$cluster,1], K$centers[K$cluster,2], col=COL[i])
}
seeds
MOS.sums

Random sample from given bivariate discrete distribution

Suppose I have a bivariate discrete distribution, i.e. a table of probability values P(X=i,Y=j), for i=1,...n and j=1,...m. How do I generate a random sample (X_k,Y_k), k=1,...N from such distribution? Maybe there is a ready R function like:
sample(100,prob=biprob)
where biprob is 2 dimensional matrix?
One intuitive way to sample is the following. Suppose we have a data.frame
dt=data.frame(X=x,Y=y,P=pij)
Where x and y come from
expand.grid(x=1:n,y=1:m)
and pij are the P(X=i,Y=j).
Then we get our sample (Xs,Ys) of size N, the following way:
set.seed(1000)
Xs <- sample(dt$X,size=N,prob=dt$P)
set.seed(1000)
Ys <- sample(dt$Y,size=N,prob=dt$P)
I use set.seed() to simulate the "bivariateness". Intuitively I should get something similar to what I need. I am not sure that this is correct way though. Hence the question :)
Another way is to use Gibbs sampling, marginal distributions are easy to compute.
I tried googling, but nothing really relevant came up.
You are almost there. Assuming you have the data frame dt with the x, y, and pij values, just sample the rows!
dt <- expand.grid(X=1:3, Y=1:2)
dt$p <- runif(6)
dt$p <- dt$p / sum(dt$p) # get fake probabilities
idx <- sample(1:nrow(dt), size=8, replace=TRUE, prob=dt$p)
sampled.x <- dt$X[idx]
sampled.y <- dt$Y[idx]
It's not clear to me why you should care that it is bivariate. The probabilities sum to one and the outcomes are discrete, so you are just sampling from a categorical distribution. The only difference is that you are indexing the observations using rows and columns rather than a single position. This is just notation.
In R, you can therefore easily sample from your distribution by reshaping your data and sampling from a categorical distribution. Sampling from a categorical can be done using rmultinom and using which to select the index, or, as Aniko suggests, using sample to sample the rows of the reshaped data. Some bookkeeping can take care of your exact case.
Here's a solution:
library(reshape)
# Reshape data to long format.
data <- matrix(data = c(.25,.5,.1,.4), nrow=2, ncol=2)
pmatrix <- melt(data)
# Sample categorical n times.
rcat <- function(n, pmatrix) {
rows <- which(rmultinom(n,1,pmatrix$value)==1, arr.ind=TRUE)[,'row']
indices <- pmatrix[rows, c('X1','X2')]
colnames(indices) <- c('i','j')
rownames(indices) <- seq(1,nrow(indices))
return(indices)
}
rcat(3,pmatrix)
This returns 3 random draws from your matrix, reporting the i and j of the rows and columns:
i j
1 1 1
2 2 2
3 2 2

Resources