Calculating divergence between joint posterior distributions - r

I wish to calculate the distance between two 3-dimensional posterior distributions. The draws are stored at two 30,000x3 matrices.
So far I have been successful in calculating Total Variation distance between two 2-dimensional posteriors (two 30,000x2 matrices) by splitting the grid into bins. However, I am having trouble calculating the divergence between posteriors with more parameters. Some examples of related distance measures can be found here.
NOTE: I do not wish to calculate the distance between the marginals (column-wise entries), rather than obtain an overall value after comparing the joint distributions in R.
I would really appreciate it if somebody could point out what I am missing here.
EDIT 1: Some example code for calculating Total variation distance between posterior samples stored in two matrices has been added below:
EDIT 2: This is a R question.
set.seed(123)
comparison.2D <- matrix(rnorm(40000*2,0,1),ncol=2)
ground.truth.2D <- matrix(rnorm(40000*2,0,2),ncol=2)
# Function to calculate TVD between matrices with 2 columns:
Total.Variation.Distance.2D<-function(true,
comparison,
burnin,
window.size){
# Bandwidth for theta.1.
my_bw_x<-window.size
# Bandwidth for theta.2.
my_bw_y<-window.size
range_x<-range(c(true[-c(1:burnin),1],comparison[-c(1:burnin),1]))
range_y<-range(c(true[-c(1:burnin),2],comparison[-c(1:burnin),2]))
xx <- seq(range_x[1],range_x[2],by=my_bw_x)
yy <- seq(range_y[1],range_y[2],by=my_bw_y)
true.pointidxs <- matrix( c( findInterval(true[-c(1:burnin),1], xx),
findInterval(true[-c(1:burnin),2], yy) ), ncol=2)
comparison.pointidxs <- matrix( c( findInterval(comparison[-c(1:burnin),1], xx),
findInterval(comparison[-c(1:burnin),2], yy) ), ncol=2)
# Count the frequencies in the corresponding cells:
square.mat.dims <- max(length(xx),nrow=length(yy))
frequencies.true <- frequencies.comparison <- matrix(0, ncol=square.mat.dims, nrow=square.mat.dims)
for (i in 1:dim(true.pointidxs)[1]){
frequencies.true[true.pointidxs[i,1], true.pointidxs[i,2]] <- frequencies.true[true.pointidxs[i,1],
true.pointidxs[i,2]] + 1
frequencies.comparison[comparison.pointidxs[i,1], comparison.pointidxs[i,2]] <- frequencies.comparison[comparison.pointidxs[i,1],
comparison.pointidxs[i,2]] + 1
}# End for
# Normalize frequencies matrix:
frequencies.true <- frequencies.true/dim(true.pointidxs)[1]
frequencies.comparison <- frequencies.comparison/dim(comparison.pointidxs)[1]
TVD <-0.5*sum(abs(frequencies.comparison-frequencies.true))
return(TVD)
}# End function
TVD.2D <- Total.Variation.Distance.2D(true=ground.truth.2D, comparison=comparison.2D,burnin=10000,window.size=0.05)

Related

Creating a point distance component to a monte carlo simulation function in R

I am attempting to do some Monte Carlo simulations, where I have a population of 325 samples in a field. I want to create a list of composite samples (samples consisting of multiple subsamples) from the dataset, while increasing sample size, repeated 100 times. I have created the function that will do so, and have supplied that below in the code.
##Create an example data set
# x and y are coordinates
x <- c(1:100)
y <- rev(c(1:100))
## z and w are soil test values
set.seed(2345)
z <- rnorm(100,mean=50, sd=10)
set.seed(2345)
w <- rnorm(100, mean=75, sd=5)
data <- data.frame(x, y, z, w)
##Initialize list
data.step.sim.list <- list()
## Code that increases sample size
for(i in seq_len(nrow(data))){
thisdat <- replicate(100,data[sample(1:nrow(data), size=i, replace = F),], simplify = F)
data.step.sim.list[[i]] <- thisdat
}
The product becomes a list n long (n being length of dataset), with each list consisting of a list of 100 dataframes (100 coming from 100 replications) that are length 1:n length long.
I have x and y data for each sample as well, and want to stipulate that each subsample collected would be at least 'm' meters from the other samples.
I have created a function that will calculate each distance seen below. I cannot find a way to implement this into my current code. Would anyone know how to do this?
#function to compute distances
calc.dist <- function(x1, y1, x2, y2) {
d <- sqrt(((x2 - x1)^2) + ((y2 - y1)^2))
return(d)
} #end function calc.dist

Computing Spearman's rho for increasing subsets of rows in for Loop

I am trying to fit a for Loop in R in order to run correlations for multiple subsets in a data frame and then store the results in a vector.
What I have in this loop is a data frame with 2 columns, x and y, and 30 rows of different continuous measurement values in each column. The process should be repeated 100 times. The data can be invented.
What I need, is to compute the Spearman's rho for the first five rows (between x and y) and then for increasing subsets (e.g., the sixth first rows, the sevenths first rows etc.). Then, I'd need to store the rho results in a vector that I can further use.
What I had in mind (but does not work):
sortvector <- 1:(30)
for (i in 1:100)
{
sortvector <- sample(sortvector, replace = F)
xtemp <- x[sortvector]
rho <- cor.test(xtemp,y, method="spearman")$estimate
}
The problem is that the code gives me one value of rho for the whole dataframe, but I need it for increments of subsets.
How can I get rho for subsets of increasing values in a for-loop? And how can i store the coefficients in a vector that i can use afterwards?
Any help would be much appreciated, thanks.
Cheers
The easiest approach is to convertfor loop into sapply function, which returns a vector of rho's as a result of your bootstrapping:
sortvector <- 1:(30)
x <- rnorm(30)
y <- rnorm(30)
rho <- sapply(1:100, function(i) {
sortvector <- sample(sortvector, replace = F)
xtemp <- x[sortvector]
cor.test(xtemp, y, method = "spearman")$estimate
})
head(rho)
Output:
rho rho rho rho rho rho
0.014460512 -0.239599555 0.003337041 -0.126585095 0.007341491 0.264516129

how do i perform ks test on multiple columns in a matrix?

(a) Generate 1000 samples where each consists of 50 independent exponential random variables with
mean 1. Estimate the mean of each sample. Draw a histogram of the means.
(b) Perform a KS test on each sample against the null hypothesis that they are from an exponential
random variable with a mean that matches the mean of the data set. Draw a histogram of the
1000 values of D.
i did part a with this code
set.seed(0)
simdata = rexp(50000, 1)
matrixdata = matrix(simdata,nrow=50,ncol=1000)
means.exp = apply(matrixdata,2,mean)
means.exp
hist(means.exp)
but im stuck on part (b)
You can use lapply on the column indices:
# KS test on every column
# H0: pexp(rate = 1/mean(column))
lst.ks <- lapply(1:ncol(matrixdata), function(i)
ks.test(matrixdata[, i], "pexp", 1.0/means.exp[i]))
Or directly without having to rely on means.exp:
lst.ks <- lapply(1:ncol(matrixdata), function(i)
ks.test(matrixdata[, i], "pexp", 1.0/mean(matrixdata[, i])))
Here 1.0/means.exp[i] corresponds to the rate of the exponential distribution.
PS. Using means.exp = colMeans(matrixdata) is faster than apply(matrixdata, 2, mean), see e.g. here for a relevant SO post.
To extract the test statistic and store it in a vector simply sapply over the KS test results:
# Extract test statistic as vector
Dstat <- sapply(lst.ks, function(x) x$statistic);
# (gg)plot Dstat
ggplot(data.frame(D = Dstat), aes(D)) + geom_histogram(bins = 30);

How to solve equations symbolically in a loop in R?

I need to conduct Gaussian Maximum Likelihood Classification for 1000 data sets of two classes of bivariate Gaussian distributions with each 100 data points.
Here is the code to create the data sets:
# mean vector for two classes
mean1<-c(70,130) ; mean2<-c(148,160)
# covariance matrix for two classes
cov1<-matrix(c(784,-546,-546,900),nrow=2,ncol=2,byrow=TRUE)
cov2<-matrix(c(484,285.1,285.1,324),nrow=2,ncol=2,byrow=TRUE)
library(MASS)
# Number of samples
nrs <- 1000
# sample size
ss <- 100
# number of dimensions
d <- length(mean1)
set.seed(1)
# generation of bivariate normal random variables based on mean vector and covariance matrix for each class
refdata_1 <- replicate(nrs,matrix(mvrnorm(ss, mu = mean1, Sigma = cov1 ),ncol = d,nrow = ss),simplify=FALSE)
refdata_2 <- replicate(nrs,matrix(mvrnorm(ss, mu = mean2, Sigma = cov2 ),ncol = d,nrow = ss),simplify=FALSE)
# calculation of mean vector for each sample of random reference data
mean_refdata_1 <- lapply(refdata_1,colMeans)
mean_refdata_2 <- lapply(refdata_2,colMeans)
# calculation of covariance matrix for each sample of random reference data
cov_refdata_1 <- lapply(refdata_1,cov)
cov_refdata_2 <- lapply(refdata_2,cov)
Now, I need to plot the decision boundary between the two classes for each of the 1000 data sets (thus 1000 decision boundaries).
Here is the decision equation (if you wonder where the ln p(class) part is, both classes have same probability and thus cancel each other out):
This is the vector of the data points:
x = vector(SR,var('a,b'))
Here is the decision equation (if you wonder where the ln p(class) part is, both classes have same probability and thus cancel each other out):
decision1 =-0.5*log(det(cov1))-0.5*((x-mean1)*cov1.inverse()*(x-mean1))
decision2 =-0.5*log(det(cov2))-0.5*((x-mean2)*cov2.inverse()*(x-mean2))
If decision1(data point) > decision2(data point), then the data point belongs to class 1. In order to get the decision boundary, decision1 - decision2 == 0. The data points are RBG images. Thus, a in the data vector x is 0:255. I solve the equation for b:
solve(decision1-decision2==0,b)
In R, that looks for the original data set like this:
m_1<-c(70,130) ; m_2<-c(148,160)
covma_1<-matrix(c(784,-546,-546,900),nrow=2,ncol=2,byrow=TRUE)
covma_2<-matrix(c(484,285.1,285.1,324),nrow=2,ncol=2,byrow=TRUE)
library(rSymPy)
c11 <- Var("c11")
c12 <- Var("c12")
c13 <- Var("c13")
c14 <- Var("c14")
sympy("covma_1 = Matrix([[c11,c12], [c13,c14]])")
a <- Var("a")
b <- Var("b")
sympy("x = Matrix([a,b])")
m11 <- Var("m11")
m12 <- Var("m12")
sympy("m_1 = Matrix([m11,m12])")
sympy("covma_1=covma_1.subs(c11,784)")
sympy("covma_1=covma_1.subs(c12,-546)")
sympy("covma_1=covma_1.subs(c13,-546)")
sympy("covma_1=covma_1.subs(c14,900)")
sympy("m_1=m_1 .subs(m11,70)")
sympy("m_1=m_1 .subs(m12,130)")
first <-sympy("-0.5*log(covma_1.det())")
second <-sympy("-0.5*((x-m_1).T*covma_1.inv()*(x-m_1))")
second<-gsub("\\[","",second)
second<-gsub("\\]","",second)
c21 <- Var("c21")
c22 <- Var("c22")
c23 <- Var("c23")
c24 <- Var("c24")
sympy("covma_2 = Matrix([[c21,c22], [c23,c24]])")
m21 <- Var("m21")
m22 <- Var("m22")
sympy("m_2 = Matrix([m21,m22])")
sympy("covma_2=covma_2.subs(c21,484)")
sympy("covma_2=covma_2.subs(c22,285.1)")
sympy("covma_2=covma_2.subs(c23,285.1)")
sympy("covma_2=covma_2.subs(c24,324)")
sympy("m_2=m_2.subs(m21,148)")
sympy("m_2=m_2.subs(m22,160)")
third <-sympy("-0.5*log(covma_2.det())")
fourth <-sympy("-0.5*((x-m_2).T*covma_2.inv()*(x-m_2))")
fourth<-gsub("\\[","",fourth)
fourth<-gsub("\\]","",fourth)
class1 <- paste(c(first,second),collapse="")
class2 <- paste(c(third,fourth),collapse="")
sympy(paste(c("hm=solve(",class2,"-","(",class1,")",",b)"), collapse = ""))
As you can see, I use very nasty string operations to parse into sympy. Anyway, after solving for b in sympy, I stuck and don't know how to get numeric values. Can somebody tell me how to solve symbolically for b and plot it in a loop for 1000 data sets? I m also open for non-symbolic approaches.
Any help is appreciated!

Saving output from for-loop to 3D array in R

I am working in R to save outputs from a 'for' loop in to a 3D matrix. I have been unable to adapt a similar example answered here for my purposes, so I'd like to share a different example.
I have a mostly-completed "for" loop that generates slopes and intercepts from a linear model for N iterations; with each iteration using a new set of y-values with a random t-distribution ('rt').
The desired resulting output is a 3D matrix with two slices, here named "out2". One slice is named "Intercept" and the other is "Slope." Each column in both of the sheets is a result from the model generated with different degrees of dreedom (dfs)
set.seed(14)
x <- sample(0:50, 15) # Generate x-values for simulation
true.a <- 1.5 # Intercept for linear relationship
true.m <- 5 # Slope for linear relationship
dfs <- c(1,2,3,4,6,8,10,15,20,25) # Degrees of freedom
N <- 1000 # Reps in for-loop
out2 <- array(NA, dim=c(N, length(dfs), 2))
dimnames(out2) <- list(NULL, dfs, c("Intercept", "Slope"))
for(j in 1:length(dfs)) {
df.tdist <- dfs[j]
for(i in 1:N) {
y <- true.a + true.m * x + 25*rt(15,df.tdist)
fit <- lm(y ~ x)
out2[ ] <- ?????????????
# The output array 'out2' will consist of two "slices", one with intercepts
and one with slopes. The length of each slice is 1000 rows, and the
width of each slice is 10 columns
}
}
Thanks greatly in advance for your feedback.

Resources