How to solve equations symbolically in a loop in R? - r

I need to conduct Gaussian Maximum Likelihood Classification for 1000 data sets of two classes of bivariate Gaussian distributions with each 100 data points.
Here is the code to create the data sets:
# mean vector for two classes
mean1<-c(70,130) ; mean2<-c(148,160)
# covariance matrix for two classes
cov1<-matrix(c(784,-546,-546,900),nrow=2,ncol=2,byrow=TRUE)
cov2<-matrix(c(484,285.1,285.1,324),nrow=2,ncol=2,byrow=TRUE)
library(MASS)
# Number of samples
nrs <- 1000
# sample size
ss <- 100
# number of dimensions
d <- length(mean1)
set.seed(1)
# generation of bivariate normal random variables based on mean vector and covariance matrix for each class
refdata_1 <- replicate(nrs,matrix(mvrnorm(ss, mu = mean1, Sigma = cov1 ),ncol = d,nrow = ss),simplify=FALSE)
refdata_2 <- replicate(nrs,matrix(mvrnorm(ss, mu = mean2, Sigma = cov2 ),ncol = d,nrow = ss),simplify=FALSE)
# calculation of mean vector for each sample of random reference data
mean_refdata_1 <- lapply(refdata_1,colMeans)
mean_refdata_2 <- lapply(refdata_2,colMeans)
# calculation of covariance matrix for each sample of random reference data
cov_refdata_1 <- lapply(refdata_1,cov)
cov_refdata_2 <- lapply(refdata_2,cov)
Now, I need to plot the decision boundary between the two classes for each of the 1000 data sets (thus 1000 decision boundaries).
Here is the decision equation (if you wonder where the ln p(class) part is, both classes have same probability and thus cancel each other out):
This is the vector of the data points:
x = vector(SR,var('a,b'))
Here is the decision equation (if you wonder where the ln p(class) part is, both classes have same probability and thus cancel each other out):
decision1 =-0.5*log(det(cov1))-0.5*((x-mean1)*cov1.inverse()*(x-mean1))
decision2 =-0.5*log(det(cov2))-0.5*((x-mean2)*cov2.inverse()*(x-mean2))
If decision1(data point) > decision2(data point), then the data point belongs to class 1. In order to get the decision boundary, decision1 - decision2 == 0. The data points are RBG images. Thus, a in the data vector x is 0:255. I solve the equation for b:
solve(decision1-decision2==0,b)
In R, that looks for the original data set like this:
m_1<-c(70,130) ; m_2<-c(148,160)
covma_1<-matrix(c(784,-546,-546,900),nrow=2,ncol=2,byrow=TRUE)
covma_2<-matrix(c(484,285.1,285.1,324),nrow=2,ncol=2,byrow=TRUE)
library(rSymPy)
c11 <- Var("c11")
c12 <- Var("c12")
c13 <- Var("c13")
c14 <- Var("c14")
sympy("covma_1 = Matrix([[c11,c12], [c13,c14]])")
a <- Var("a")
b <- Var("b")
sympy("x = Matrix([a,b])")
m11 <- Var("m11")
m12 <- Var("m12")
sympy("m_1 = Matrix([m11,m12])")
sympy("covma_1=covma_1.subs(c11,784)")
sympy("covma_1=covma_1.subs(c12,-546)")
sympy("covma_1=covma_1.subs(c13,-546)")
sympy("covma_1=covma_1.subs(c14,900)")
sympy("m_1=m_1 .subs(m11,70)")
sympy("m_1=m_1 .subs(m12,130)")
first <-sympy("-0.5*log(covma_1.det())")
second <-sympy("-0.5*((x-m_1).T*covma_1.inv()*(x-m_1))")
second<-gsub("\\[","",second)
second<-gsub("\\]","",second)
c21 <- Var("c21")
c22 <- Var("c22")
c23 <- Var("c23")
c24 <- Var("c24")
sympy("covma_2 = Matrix([[c21,c22], [c23,c24]])")
m21 <- Var("m21")
m22 <- Var("m22")
sympy("m_2 = Matrix([m21,m22])")
sympy("covma_2=covma_2.subs(c21,484)")
sympy("covma_2=covma_2.subs(c22,285.1)")
sympy("covma_2=covma_2.subs(c23,285.1)")
sympy("covma_2=covma_2.subs(c24,324)")
sympy("m_2=m_2.subs(m21,148)")
sympy("m_2=m_2.subs(m22,160)")
third <-sympy("-0.5*log(covma_2.det())")
fourth <-sympy("-0.5*((x-m_2).T*covma_2.inv()*(x-m_2))")
fourth<-gsub("\\[","",fourth)
fourth<-gsub("\\]","",fourth)
class1 <- paste(c(first,second),collapse="")
class2 <- paste(c(third,fourth),collapse="")
sympy(paste(c("hm=solve(",class2,"-","(",class1,")",",b)"), collapse = ""))
As you can see, I use very nasty string operations to parse into sympy. Anyway, after solving for b in sympy, I stuck and don't know how to get numeric values. Can somebody tell me how to solve symbolically for b and plot it in a loop for 1000 data sets? I m also open for non-symbolic approaches.
Any help is appreciated!

Related

Creating a point distance component to a monte carlo simulation function in R

I am attempting to do some Monte Carlo simulations, where I have a population of 325 samples in a field. I want to create a list of composite samples (samples consisting of multiple subsamples) from the dataset, while increasing sample size, repeated 100 times. I have created the function that will do so, and have supplied that below in the code.
##Create an example data set
# x and y are coordinates
x <- c(1:100)
y <- rev(c(1:100))
## z and w are soil test values
set.seed(2345)
z <- rnorm(100,mean=50, sd=10)
set.seed(2345)
w <- rnorm(100, mean=75, sd=5)
data <- data.frame(x, y, z, w)
##Initialize list
data.step.sim.list <- list()
## Code that increases sample size
for(i in seq_len(nrow(data))){
thisdat <- replicate(100,data[sample(1:nrow(data), size=i, replace = F),], simplify = F)
data.step.sim.list[[i]] <- thisdat
}
The product becomes a list n long (n being length of dataset), with each list consisting of a list of 100 dataframes (100 coming from 100 replications) that are length 1:n length long.
I have x and y data for each sample as well, and want to stipulate that each subsample collected would be at least 'm' meters from the other samples.
I have created a function that will calculate each distance seen below. I cannot find a way to implement this into my current code. Would anyone know how to do this?
#function to compute distances
calc.dist <- function(x1, y1, x2, y2) {
d <- sqrt(((x2 - x1)^2) + ((y2 - y1)^2))
return(d)
} #end function calc.dist

How to generate data for gaussian distributions in these 2 scenarios in R?

In "Elements of Statistical Learning" by Tibshirani, when comparing least squares/linear models and knn these 2 scenarios are stated:
Scenario 1: The training data in each class were generated from bivariate Gaussian distributions with uncorrelated components and different means.
Scenario 2: The training data in each class came from a mixture of 10
low- variance Gaussian distributions, with individual means themselves
distributed as Gaussian.
The idea is that the first is better suited for least squares/linear models and the second for knn like models (those with higher variance from what i understand since knn takes into account the closest points and not all points).
In R, how would I simulate data for both scenarios?
The end goal is to be able to reproduce both scenarios in order to prove that effectively the 1st one is better explained by the linear model than the 2nd one.
Thanks!
This could be scenario 1
library(mvtnorm)
N1 = 50
N2 = 50
K = 2
mu1 = c(-1,3)
mu2 = c(2,0)
cov1 = 0
v11 = 2
v12 = 2
Sigma1 = matrix(c(v11,cov1,cov1,v12),nrow=2)
cov2 = 0
v21 = 2
v22 = 2
Sigma2 = matrix(c(v21,cov2,cov2,v22),nrow=2)
x1 = rmvnorm(N1,mu1,Sigma1)
x2 = rmvnorm(N2,mu2,Sigma2)
This could be a candidate for simulating from a Gaussian mixture:
BartSimpson <- function(x,n = 100){
means <- as.matrix(sort(rnorm(10)))
dens <- .1*rowSums(apply(means,1,dnorm,x=x,sd=.1))
rBartSimpson <- c(apply(means,1,rnorm,n=n/10,sd=.1))
return(list("thedensity" = dens,"draws" = rBartSimpson))
}
x <- seq(-5,5,by=.01)
plot(x,BartSimpson(x)$thedensity,type="l",lwd=4,col="yellow2",xlim=c(-4,4),ylim=c(0,0.6))
In the code below, I first create the 10 different means of the classes, and then use the means to draw random values from those means. The code is identical for the two scenarios, but you'll have to adjust the variance within and between classes to get the results you want.
Scenario 1:
Here you want to generate 10 classes with different means (I assume the means follow a bivariate gaussian distribution). The difference between classes is much less than the difference within classes.
library(MASS)
n <- 20
# subjects per class
classes <- 10
# number of classes
mean <- 100
# mean value for all classes
var.between <- 25
# variation between classes
var.within <- 225
# variation within classes
covmatrix1 <- matrix(c(var.between,0,0,var.between), nrow=2)
# covariance matrix for the classes
means <- mvrnorm(classes, c(100,100), Sigma=covmatrix1)
# creates the means for the two variables for each class using variance between classes
covmatrix2 <- matrix(c(var.within,0,0,var.within), nrow=2)
# creates a covariance matrix for the subjects
class <- NULL
values <- NULL
for (i in 1:10) {
temp <- mvrnorm(n, c(means[i], means[i+classes]), Sigma=covmatrix2)
class <- c(class, rep(i, n))
values <- c(values, temp)
}
# this loop uses generates data for each class based on the class means and variance within classes
valuematrix <- matrix(values, nrow=(n*classes))
data <- data.frame (class, valuematrix)
plot(data$X1, data$X2)
Alternatively, if you don't care about specifying the variance between the classes, and you don't want any correlation within classes, you can just do this:
covmatrix <- matrix(c(225, 0, 0, 225), nrow=2)
# specifies that the variance in both groups is 225 and no covariance
values <- matrix(mvrnorm(200, c(100,100), Sigma=covmatrix), nrow=200)
# creates a matrix of 200 individuals with two values each.
Scenario 2:
Here the only difference is that the variation between classes is larger than the variation within classes. Try exchanging the value of the variable var.between to around 500 and the variable var.within to 25 and you'll see a clear clustering in the scatterplot:
n <- 20
# subjects per class
classes <- 10
# number of classes
mean <- 100
# mean value for all classes
var.between <- 500
# variation between classes
var.within <- 25
# variation within classes
covmatrix1 <- matrix(c(var.between,0,0,var.between), nrow=2)
# covariance matrix for the classes
means <- mvrnorm(classes, c(100,100), Sigma=covmatrix1)
# creates the means for the two variables for each class using variance between classes
covmatrix2 <- matrix(c(var.within,0,0,var.within), nrow=2)
# creates a covariance matrix for the subjects
class <- NULL
values <- NULL
for (i in 1:10) {
temp <- mvrnorm(n, c(means[i], means[i+classes]), Sigma=covmatrix2)
class <- c(class, rep(i, n))
values <- c(values, temp)
}
# this loop uses generates data for each class based on the class means and variance within classes
valuematrix <- matrix(values, nrow=(n*classes))
data <- data.frame (class, valuematrix)
plot(data$X1, data$X2)
The plot should confirm that the data are clustered.
Hope this helps!
With the help from both answers here I ended up using this:
mixed_dists = function(n, n_means, var=0.2) {
means = rnorm(n_means, mean=1, sd=2)
values <- NULL
class <- NULL
for (i in 1:n_means) {
temp <- rnorm(n/n_means, mean=means[i], sd=0.2)
class <- c(class, rep(i, n/n_means))
values <- c(values, temp)
}
return(list(values, class));
}
N = 100
#Scenario 1: The training data in each class were generated from bivariate Gaussian distributions
#with uncorrelated components and different means.
scenario1 = function () {
var = 0.5
n_groups = 2
m = mixed_dists(N, n_groups, var=var)
x = m[[1]]
group = m[[2]]
y = mixed_dists(N, n_groups, var=var)[[1]]
data = matrix(c(x,y, group), nrow=N, ncol=3)
colnames(data) = c("x", "y", "group")
data = data.frame(data)
plot(x=data$x,y=data$y, col=data$group)
model = lm(y~x, data=data)
summary(model)
}
#Scenario 2: The training data in each class came from a mixture of 10
#low-variance Gaussian distributions, with individual means themselves
#distributed as Gaussian.
scenario2 = function () {
var = 0.2 # low variance
n_groups = 10
m = mixed_dists(N, n_groups, var=var)
x = m[[1]]
group = m[[2]]
y = mixed_dists(N, n_groups, var=var)[[1]]
data = matrix(c(x,y, group), nrow=N, ncol=3)
colnames(data) = c("x", "y", "group")
data = data.frame(data)
plot(x=data$x,y=data$y, col=data$group)
model = lm(y~x, data=data)
summary(model)
}
# scenario1()
# scenario2()
So basically the data in scenario 1 is cleanly separated in 2 classes and the data in scenario 2 has about 10 clusters and can't be cleanly separated using a straight line. Indeed, running the linear model on both scenarios it can be seen that on average it will apply better to scenario 1 than to scenario 2.

Calculating divergence between joint posterior distributions

I wish to calculate the distance between two 3-dimensional posterior distributions. The draws are stored at two 30,000x3 matrices.
So far I have been successful in calculating Total Variation distance between two 2-dimensional posteriors (two 30,000x2 matrices) by splitting the grid into bins. However, I am having trouble calculating the divergence between posteriors with more parameters. Some examples of related distance measures can be found here.
NOTE: I do not wish to calculate the distance between the marginals (column-wise entries), rather than obtain an overall value after comparing the joint distributions in R.
I would really appreciate it if somebody could point out what I am missing here.
EDIT 1: Some example code for calculating Total variation distance between posterior samples stored in two matrices has been added below:
EDIT 2: This is a R question.
set.seed(123)
comparison.2D <- matrix(rnorm(40000*2,0,1),ncol=2)
ground.truth.2D <- matrix(rnorm(40000*2,0,2),ncol=2)
# Function to calculate TVD between matrices with 2 columns:
Total.Variation.Distance.2D<-function(true,
comparison,
burnin,
window.size){
# Bandwidth for theta.1.
my_bw_x<-window.size
# Bandwidth for theta.2.
my_bw_y<-window.size
range_x<-range(c(true[-c(1:burnin),1],comparison[-c(1:burnin),1]))
range_y<-range(c(true[-c(1:burnin),2],comparison[-c(1:burnin),2]))
xx <- seq(range_x[1],range_x[2],by=my_bw_x)
yy <- seq(range_y[1],range_y[2],by=my_bw_y)
true.pointidxs <- matrix( c( findInterval(true[-c(1:burnin),1], xx),
findInterval(true[-c(1:burnin),2], yy) ), ncol=2)
comparison.pointidxs <- matrix( c( findInterval(comparison[-c(1:burnin),1], xx),
findInterval(comparison[-c(1:burnin),2], yy) ), ncol=2)
# Count the frequencies in the corresponding cells:
square.mat.dims <- max(length(xx),nrow=length(yy))
frequencies.true <- frequencies.comparison <- matrix(0, ncol=square.mat.dims, nrow=square.mat.dims)
for (i in 1:dim(true.pointidxs)[1]){
frequencies.true[true.pointidxs[i,1], true.pointidxs[i,2]] <- frequencies.true[true.pointidxs[i,1],
true.pointidxs[i,2]] + 1
frequencies.comparison[comparison.pointidxs[i,1], comparison.pointidxs[i,2]] <- frequencies.comparison[comparison.pointidxs[i,1],
comparison.pointidxs[i,2]] + 1
}# End for
# Normalize frequencies matrix:
frequencies.true <- frequencies.true/dim(true.pointidxs)[1]
frequencies.comparison <- frequencies.comparison/dim(comparison.pointidxs)[1]
TVD <-0.5*sum(abs(frequencies.comparison-frequencies.true))
return(TVD)
}# End function
TVD.2D <- Total.Variation.Distance.2D(true=ground.truth.2D, comparison=comparison.2D,burnin=10000,window.size=0.05)

nls() in R using entire matrix

I have data which I want to fit to the following equation using R:
Z(u,w)=z0*F(w)*[1-exp((-b*u)/F(w))]
where z0 and b are constants and F(w), w=0,...,9 is a decreasing step function that depends on w with F(0)=1 and u=1,...,50.
Z(u,w) is an observed set of data in the form of a 50x10 matrix (u=50,...,1 down the side of the rows and w=0,...,9 along the columns). For example as I haven't explained that great, Z(42,3) will be the element in the 9th row down and the 4th column along.
Using F(0)=1 I was able to get estimates of b and z0 using just the first column (ie w=0) with the code:
n0=nls(zuw~z0*(1-exp(-b*u)),start=list(z0=283,b=0.03),options(digits=10))
I then found F(w) for w=1,...,9 by going through each columns and using the vlaues of b and z0 I found.
However, I was wanting to find a way to estimate all the 12 parameters at once (b, z0 and the 10 values of F(w)) as b and z0 should be fitted to all the data, not just the first column.
Does anyone know of any way of doing this? All help would be greatly appreciated!
Thanks
James
This may be a case where the formula interface of the nls(...) function works against you. As an alternative, you can use nls.lm(...) in the minpack.lm package to perform non-linear regression with a programmatically defined function. To demonstrate this, first we create an artificial dataset which follows your functional form by design, with random error added (error ~ N[0,1]).
u <- 1:50
w <- 0:9
z0 <- 100
b <- 0.02
F <- 10/(10+w^2)
# matrix containing data, in OP's format: rows are u, cols are w
m <- do.call(cbind,lapply(w,function(w)
z0*F[w+1]*(1-exp(-b*u/F[w+1]))+rnorm(length(u),0,1)))
So now we have a matrix m, which is equivalent to your dataset. This matrix is in the so-called "wide" format - the response for different values of w is in different columns. We need it in "long" format: all responses in a single column, with a separate columns identifying u and w. We do this using melt(...) in the reshape2 package.
# prepend values of u
df.wide <- data.frame(u=u, m)
library(reshape2)
# reshape to long format: col1 = u, col2=w, col3=z
df <- melt(df.wide,id="u",variable.name="w", value.name="z")
df$w <- as.numeric(substr(df$w,2,4))-1
Now we have a data frame df with columns u, w, and z. The nls.lm(...) function takes (at least) 4 arguments: par is a vector of initial estimates of the parameters of the fit, fn is a function that calculates the residuals at each step, observed is the dependent variable (z), and xx is a vector or matrix containing the independent variables (u, v).
Next we define a function, f(par, xx), where par is an 11 element vector. The first two elements contain estimates of z0 and b. The next 9 contain estimates of F(w), w=1:9. This is because you state that F(0) is known to be 1. xx is a matrix with two columns: the values for u and w respectively. f(par,xx) then calculates estimate of the response z for all values of u and w, for the given parameter estimates.
library(minpack.lm)
# model function
f <- function(pars, xx) {
z0 <- pars[1]
b <- pars[2]
F <- c(1,pars[3:11])
u <- xx[,1]
w <- xx[,2]
z <- z0*F[w+1]*(1-exp(-b*u/F[w+1]))
return(z)
}
# residual function
resids <- function(p, observed, xx) {observed - f(p,xx)}
Next we perform the regression using nls.lm(...), which uses a highly robust fitting algorithm (Levenberg-Marquardt). Consequently, we can set the par argument (containing the initial estimates of z0, b, and F) to all 1's, which is fairly distant from the values used in creating the dataset (the "actual" values). nls.lm(...) returns a list with several components (see the documentation). The par component contains the final estimates of the fit parameters.
# initial parameter estimates; all 1's
par.start <- c(z0=1, b=1, rep(1,9))
# fit using Levenberg-Marquardt algorithm
nls.out <- nls.lm(par=par.start,
fn = resids, observed = df$z, xx = df[,c("u","w")],
control=nls.lm.control(maxiter=10000, ftol=1e-6, maxfev=1e6))
par.final <- nls.out$par
results <- rbind(predicted=c(par.final[1:2],1,par.final[3:11]),actual=c(z0,b,F))
print(results,digits=5)
# z0 b
# predicted 102.71 0.019337 1 0.90456 0.70788 0.51893 0.37804 0.27789 0.21204 0.16199 0.13131 0.10657
# actual 100.00 0.020000 1 0.90909 0.71429 0.52632 0.38462 0.28571 0.21739 0.16949 0.13514 0.10989
So the regression has done an excellent job at recovering the "actual" parameter values. Finally, we plot the results using ggplot just to make sure this is all correct. I can't overwmphasize how important it is to plot the final results.
df$pred <- f(par.final,df[,c("u","w")])
library(ggplot2)
ggplot(df,aes(x=u, color=factor(w)))+
geom_point(aes(y=z))+ geom_line(aes(y=pred))

Permutations of correlation coefficients

My question is on the permutation of correlation coefficients.
A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))
B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))
cor(A,B)
# A2 B2 C2
# A1 0.9481224 0.9190183 0.459588
# B1 0.9481224 0.9190183 0.459588
# C1 0.9481224 0.9190183 0.459588
I obtained this correlation and then wanted to perform permutation tests to check if the correlation still holds.
I did the permutation as follows:
A<-as.vector(t(A))
B<-as.vector(t(B))
corperm <- function(A,B,1000) {
# n is the number of permutations
# x and y are the vectors to correlate
obs = abs(cor(A,B))
tmp = sapply(1:n,function(z) {abs(cor(sample(A,replace=TRUE),B))})
return(1-sum(obs>tmp)/n)
}
The result was
[1] 0.645
and using "cor.test"
cor.test(A,B)
Pearson's product-moment correlation
data: A and B
t = 0.4753, df = 13, p-value = 0.6425
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4089539 0.6026075
sample estimates:
cor
0.1306868
How could I draw a plot or a histogram to show the actual correlation and the permuted correlation value from the permuted data ???
first of all, you can't have done it exactly this ways as ...
> corperm = function(A,B,1000) {
Error: unexpected numeric constant in "corperm = function(A,B,1000"
The third argument has no name but it should have one! Perhaps you meant
> corperm <- function(A, B, n=1000) {
# etc
Then you need to think about what do you want to achieve. Initially you have two data sets with 3 variables and then you collapse them into two vectors and compute a correlation between the permuted vectors. Why does it make sense? The structure of permuted data set should be the same as the original data set.
obs = abs(cor(A,B))
tmp = sapply(1:n,function(z) {abs(cor(sample(A,replace=TRUE),B))})
return(1-sum(obs>tmp)/n)
Why do you use replace=TRUE here? This makes sense if you would like to have bootstrap CI-s but (a) it'd be better to use a dedicated function then e.g boot from package boot, and (B) you'd need to do the same with B, i.e. sample(B, replace=TRUE).
For permutation test you sample without replacement and it makes no difference whether you do it for both A and B or only A.
And how to get the histogram? Well, hist(tmp) would draw you a histogram of the permuted values, and obs is absolute value of the observed correlation.
HTHAB
(edit)
corperm <- function(x, y, N=1000, plot=FALSE){
reps <- replicate(N, cor(sample(x), y))
obs <- cor(x,y)
p <- mean(reps > obs) # shortcut for sum(reps > obs)/N
if(plot){
hist(reps)
abline(v=obs, col="red")
}
p
}
Now you can use this on a single pair of variables:
corperm(A[,1], B[,1])
To apply it to all pairs, use for or mapply. for is easier to understand so I wouldn't insist in using mapply to get all possible pairs.
res <- matrix(NA, nrow=NCOL(A), ncol=NCOL(B))
for(iii in 1:3) for(jjj in 1:3) res[iii,jjj] <- corperm(A[,iii], B[,jjj], plot=FALSE)
rownames(res)<-names(A)
colnames(res) <- names(B)
print(res)
To make all histograms, use plot=TRUE above.
I think there is not much significance to do permutation test for correlation analysis of two variants, because the cor.test()function offers "p.value" which has the same effect as permutation test.

Resources