Why is Adam optimization unable to converge in linear regression? - r

I am studying Adam optimizer. This is a toy problem. In R, I generate some artificial data:
Y = c0 + c1 * x1 + c2 * x2 + noise
In the above equation, x1, x2 and noise are normal random numbers I generated in R, theta = [c0, c1, c2] is the parameter I try to estimate with Adam optimizer. For this simple regression problem, I can use analytical method to determine the theta parameter which is the k in my R codes below.
Regarding Adam algorithm, I use the formulae from this site
Overview: Adam
I change the step size eta in this parametric study. The final theta from Adam algorithm is not the same as the analytical solution k in my R codes.
I checked my codes many times. I run the codes line by line and cannot understand why Adam algorithm cannot converge.
Added:
I changed the algorithm to AMSGrad. It perform better than Adam in this case. However, AMSGrad does not converge.
rm(list = ls())
n=500
x1=rnorm(n,mean=6,sd=1.6)
x2=rnorm(n,mean=4,sd=2.5)
X=cbind(x1,x2)
A=as.matrix(cbind(intercept=rep(1,n),x1,x2))
Y=-20+51*x1-15*x2+rnorm(n,mean=0,sd=2);
k=solve(t(A)%*%A,t(A)%*%Y) # k is the parameters determined by analytical method
MSE=sum((A%*%k-Y)^2)/(n);
iterations=4000 # total number of steps
epsilon = 0.0001 # set precision
eta=0.04 # step size
beta1=0.9
beta2=0.999
t1=integer(iterations)
t2=matrix(0,iterations,3)
t3=integer(iterations)
epsilon1=1E-8 # small number defined for numerical computation
X=as.matrix(X)# convert data table X into a matrix
N=dim(X)[1] # total number of observations
X=as.matrix(cbind(intercept=rep(1,length(N)),X))# add a column of ones to represent intercept
np=dim(X)[2] # number of parameters to be determined
theta=matrix(rnorm(n=np,mean=0,sd=2),1,np) # Initialize theta:1 x np matrix
m_i=matrix(0,1,np) # initialization, zero vector
v_i=matrix(0,1,np) # initialization, zero vector
for(i in 1:iterations){
error=theta%*%t(X)-t(Y) # error = (theta * x' -Y'). Error is a 1xN row vector;
grad=1/N*error%*%X # Gradient grad is 1 x np vector
m_i=beta1*m_i+(1-beta1)*grad # moving average of gradient, 1 x np vector
v_i=beta2*v_i+(1-beta2)*grad^2 # moving average of squared gradients, 1 x np vector
# corrected moving averages
m_corrected=m_i/(1-beta1^i)
v_corrected=v_i/(1-beta2^i)
d_theta=eta/(sqrt(v_corrected)+epsilon1)*m_corrected
theta=theta-d_theta
L=sqrt(sum((d_theta)^2)) # calculating the L2 norm
t1[i]=L # record the L2 norm in each step
if ((is.infinite(L))||(is.nan(L))) {
print("Learning rate is too large. Lowering the rate may help.")
break
}
else if (L<=epsilon) {
print("Algorithm convergence is reached.")
break # checking whether convergence is obtained or not
}
# if (i==1){
# browser()
# }
}
plot(t1,type="l",ylab="norm",lwd=3,col=rgb(0,0,1))
k
theta

Related

find total variation distance between multinomial distributions in r

I am comparing Bayes estimators to MLE in multinomial distributions. I am drawing random samples using rmultinom from a particular multinomial distribution using
rmultinom(400, size = 30, prob = c(5,7,10,8,14,10,15,12,10,9))
For each of the 400 samples, I compute the MLE and Bayes estimators for the ten probability parameters. I now want to find in each case the total variation distance between the true distribution and the one defined by the estimators.
Since for size 30 and 10 bins there are over 200 million possible arrangements, I don't think that using the theoretical definition is a good idea.
The package distrEx has a function "TotalVarDist()", but it can only be used with distributions defined in the distr package, and multinomial is not one of them. There are directions for defining them (see here and here) but the options are either to define a discrete distribution by explicitly listing the support (again, I don't think this is a good option since the support has a size of over 200 million) or starting from scratch using the same methods as how the distr package was created, which is beyond my current ability.
Any thoughts on how to do this, either using the packages mentioned or in a completely different way?
My answer is about how to calculate this using base R.
We have two multinomial parameter vectors, θ and η. The total variation distance is equivalent to P_θ(E) - P_η(E), where E={ω | P_θ({ω})>P_η({ω})}, and ω is a vector of sample counts.
I know of two ways to evaluate P(E) in base R. One is a very simple simulation-based method. The other reframes the problem in terms of a linear combination of the counts, which is approximately normally distributed, and uses the pnorm function.
Simulation-based method
You simulate samples from each distribution, check whether they're in E using the probability mass functions, and count how often they are. I'll go through an example here. We'll assume the true distribution from your question:
unnormalized.true <- c(5,7,10,8,14,10,15,12,10,9)
true <- unnormalized.true / sum(unnormalized.true)
We'll draw a sample and estimate a new distribution using a Bayes estimator:
set.seed(921)
result <- as.vector(rmultinom(1, size = 30, prob = true))
result
## [1] 3 6 2 0 5 3 3 4 1 3
dirichlet <- (result+1)/(30+length(true))
Calculating the probability of E under the true distribution:
set.seed(939)
true.dist <- rmultinom(10^6, 30, true)
p.true.e <- mean(apply(true.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, dirichlet) > 0))
Calculating the probability of E under the estimated distribution from the Bayes estimator:
dirichlet.dist <- rmultinom(10^6, 30, dirichlet)
p.dirichlet.e <- mean(apply(dirichlet.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, dirichlet) > 0))
And we can subtract to get the total variation distance.
p.true.e - p.dirichlet.e
## [1] 0.83737
Repeating this with the maximum likelihood estimate, we get a comparison of the estimators.
mle <- result/30
mle.dist <- rmultinom(10^6, 30, mle)
p.true.e2 <- mean(apply(true.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, mle) > 0))
p.mle.e2 <- mean(apply(mle.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, mle) > 0))
p.true.e2 - p.mle.e2
## [1] 0.968301
(edited to fix a serious mistake. Previously I had re-used p.true.e in the comparison with the MLE. I forgot that the event E is defined in terms of the estimated distribution.)
Normal approximation
I think this method is actually more accurate than the simulation based method, despite the normal approximation. As you'll see, we're not taking a normal approximation to the multinomial counts, which would be unlikely to be accurate for n=30. We're taking a normal approximation to a linear combination of these counts, which is close to normal. The weakness of this method will turn out to be that it can't handle zero probabilities in the estimated distribution. That's a real problem, since handling zeros gracefully is, to me, part of the point of using total variation distance rather than Kullback-Leibler divergence. But here it is.
The following derivation yields a restatement of E:
Define
where N_i is one cell of the multinomial sample, and
Then, E is the event that L>0.
The reason we have a problem with a zero probability is that it causes one of the λ_i's to be infinite.
I want to verify that L is close to normally distributed, in the example from before. I'll do that by getting samples from the distribution of L, using the previous multinomial simulation:
lambda <- log(true/dirichlet)
L.true.dist <- apply(true.dist, 2, function(x) sum(lambda*x))
L.dirichlet.dist <- apply(dirichlet.dist, 2, function(x) sum(lambda*x))
Note that I'm doing the comparison between the true distribution and the Bayes estimated distribution. I can't do the one with the MLE, because my sample had a zero count.
Plotting the distribution of L and comparing to a normal fit:
par(mfrow=c(1,2))
L.true.dist.hist <- hist(L.true.dist)
L.true.dist.fit <- function(x)
length(L.true.dist) * diff(L.true.dist.hist$breaks)[1] *
dnorm(x, mean(L.true.dist), sd=sd(L.true.dist))
curve(L.true.dist.fit, add=TRUE, n=1000, col='red')
L.dirichlet.dist.hist <- hist(L.dirichlet.dist)
L.dirichlet.dist.fit <- function(x)
length(L.dirichlet.dist) * diff(L.dirichlet.dist.hist$breaks)[1] *
dnorm(x, mean(L.dirichlet.dist), sd=sd(L.dirichlet.dist))
curve(L.dirichlet.dist.fit, add=TRUE, n=1000, col='red')
par(mfrow=c(1,1))
The distribution of L appears normal. So, instead of using simulations, we can just use pnorm. However, we need to calculate the mean and standard deviation of L. This can be done as follows.
The mean of L is
where p_i is the cell probability of cell i in the distribution p. The variance is
where
is the covariance matrix of the multinomial distribution. I'll calculate these moments for this example, and check them against the empirical moments in the simulation. First, for the distribution of L under the true distribution:
n <- 30
k <- length(true)
mean.L.true <- sum(lambda * n * true)
# Did we get the mean right?
c(mean.L.true, mean(L.true.dist))
## [1] 3.873509 3.875547
# Covariance matrix assuming the true distribution
sigma.true <- outer(1:k, 1:k, function(i,j)
ifelse(i==j, n*true[i]*(1-true[i]), -n*true[i]*true[j]))
var.L.true <- t(lambda) %*% sigma.true %*% lambda
# Did we get the standard deviation right?
c(sqrt(var.L.true), sd(L.true.dist))
## [1] 2.777787 2.776945
Then, the mean and variance of L under the Bayes estimate of the distribution:
mean.L.dirichlet <- sum(lambda * n * dirichlet)
# Did we get the mean right?
c(mean.L.dirichlet, mean(L.dirichlet.dist))
## [1] -3.893836 -3.895983
# Covariance matrix assuming the estimated distribution
sigma.dirichlet <- outer(1:k, 1:k, function(i,j)
ifelse(i==j, n*dirichlet[i]*(1-dirichlet[i]), -n*dirichlet[i]*dirichlet[j]))
var.L.dirichlet <- t(lambda) %*% sigma.dirichlet %*% lambda
# Did we get the standard deviation right?
c(sqrt(var.L.dirichlet), sd(L.dirichlet.dist))
## [1] 2.796348 2.793421
With these in hand, we can calculate the total variation distance with pnorm:
pnorm(0, mean.L.true, sd=sqrt(var.L.true), lower.tail=FALSE) -
pnorm(0, mean.L.dirichlet, sd=sqrt(var.L.true), lower.tail=FALSE)
## [1] 0.8379193
# Previous result was 0.83737
We get three digits of agreement with the simulation.
I don't know of any easy way to extend the normal approximation method to handle zero probabilities, though. I had an idea, but I got stuck trying to calculate the covariance matrix of the counts conditional on a specific cell having 0 count. I could share my progress if you think you could make something of it.

Hessian Matrix in Maximum Likelihood - Gauss vs. R

I am struggling with the following problem. In a nutshell: Two different software packages (Gauss by Aptech and R) yield totally different Hessian Matrices in a Maximum Liklihood Procedure. I am using the same procedure (BFGS), the exact same data, the same maximum likelihood formula (it is a very simple logit model) with the exact same starting values and confusingly, I get the same results for the parameters and the log-likelihood. Only the Hessian matrices are different accross both programs and therefore, the estimation of the standard errors and statistical inference differs.
It does not appear much deviation in this specific example, but every increasing complication of the model increases the difference, so if I try to estimate my final model, both programs yield completely off results.
Does anyone know, how both programs differ in the way they compute the Hessian and possibly the right way to optaining the same results?
EDIT: In the R (Gauss) code, vector X (alt) is the independent variable, consisting of a two-colum vector with column one being entirely ones and the second column the subjects' responses. Vector y (itn) is the dependent variable, consisting of one columns with the subjects' responses. The example (R Code and data set) has been taken from http://www.polsci.ucsb.edu/faculty/glasgow/ps206/ps206.html, just as an example to reproduce and isolate the problem.
I have attached both codes (Gauss and R syntax) and outputs.
Any help would be greatly appreciated. Thank you :)
Gauss:
start={ 0.95568840 , -0.20459156 };
library maxlik,pgraph;
maxset;
_max_Algorithm = 2;
_max_Diagnostic = 1;
{betaa,f,g,cov,ret} = maxlik(XMAT,0,&ll,start);
call maxprt(betaa,f,g,cov,ret);
print _max_FinalHess;
proc ll(b,XMAT);
local exb, probo, logexb, yn, logexbn, yt, ynt, logl;
exb = EXP(alt*b);
//print exb;
probo = exb./(1+exb);
logexb = ln(probo);
yn = 1 - itn;
logexbn = ln(1 - probo);
yt = itn';
ynt = yn';
logl = (yt*logexb + ynt*logexbn);
print(logl);
retp(logl);
endp;
R:
startv <- c(0.95568840,-0.20459156)
logit.lf <- function(beta) {
exb <- exp(X%*%beta)
prob1 <- exb/(1+exb)
logexb <- log(prob1)
y0 <- 1 - y
logexb0 <- log(1 - prob1)
yt <- t(y)
y0t <- t(y0)
logl <- -(yt%*%logexb + y0t%*%logexb0)
return(logl)
}
logitmodel <- optim(startv, logit.lf, method="BFGS", control=list(trace=TRUE, REPORT=1), hessian=TRUE)
logitmodel$hessian
Gauss Output:
return code = 0
normal convergence
Mean log-likelihood -0.591820
Number of cases 1924
Covariance matrix of the parameters computed by the following method:
Inverse of computed Hessian
Parameters Estimates Std. err. Est./s.e. Prob. Gradient
------------------------------------------------------------------
P01 2.1038 0.2857 7.363 0.0000 0.0000
P02 -0.9984 0.2365 -4.221 0.0000 0.0000
Gauss Hessian:
0.20133256 0.23932571
0.23932571 0.29377761
R Output:
initial value 1153.210839
iter 2 value 1148.015749
iter 3 value 1141.420328
iter 4 value 1138.668174
iter 5 value 1138.662148
iter 5 value 1138.662137
iter 5 value 1138.662137
final value 1138.662137
converged
Coeff. Std. Err. z p value
[1,] 2.10379869 0.28570765 7.3634665 1.7919000e-13
[2,] -0.99837955 0.23651060 -4.2212889 2.4290942e-05
R Hessian:
[,1] [,2]
[1,] 387.34106 460.45379
[2,] 460.45379 565.24412
They are just scaled differently. The GAUSS numbers are around 1924 times smaller than the R numbers.
I think GAUSS keeps the numbers in a smaller range for numerical stability.

generating random x and y coordinates with a minimum distance

Is there a way in R to generate random coordinates with a minimum distance between them?
E.g. what I'd like to avoid
x <- c(0,3.9,4.1,8)
y <- c(1,4.1,3.9,7)
plot(x~y)
This is a classical problem from stochastic geometry. Completely random points in space where the number of points falling in disjoint regions are independent of each other corresponds to a homogeneous Poisson point process (in this case in R^2, but could be in almost any space).
An important feature is that the total number of points has to be random before you can have independence of the counts of points in disjoint regions.
For the Poisson process points can be arbitrarily close together. If you define a process by sampling the Poisson process until you don't have any points that are too close together you have the so-called Gibbs Hardcore process. This has been studied a lot in the literature and there are different ways to simulate it. The R package spatstat has functions to do this. rHardcore is a perfect sampler, but if you want a high intensity of points and a big hard core distance it may not terminate in finite time... The distribution can be obtained as the limit of a Markov chain and rmh.default lets you run a Markov chain with a given Gibbs model as its invariant distribution. This finishes in finite time but only gives a realisation of an approximate distribution.
In rmh.default you can also simulate conditional on a fixed number of points. Note that when you sample in a finite box there is of course an upper limit to how many points you can fit with a given hard core radius, and the closer you are to this limit the more problematic it becomes to sample correctly from the distribution.
Example:
library(spatstat)
beta <- 100; R = 0.1
win <- square(1) # Unit square for simulation
X1 <- rHardcore(beta, R, W = win) # Exact sampling -- beware it may run forever for some par.!
plot(X1, main = paste("Exact sim. of hardcore model; beta =", beta, "and R =", R))
minnndist(X1) # Observed min. nearest neighbour dist.
#> [1] 0.102402
Approximate simulation
model <- rmhmodel(cif="hardcore", par = list(beta=beta, hc=R), w = win)
X2 <- rmh(model)
#> Checking arguments..determining simulation windows...Starting simulation.
#> Initial state...Ready to simulate. Generating proposal points...Running Metropolis-Hastings.
plot(X2, main = paste("Approx. sim. of hardcore model; beta =", beta, "and R =", R))
minnndist(X2) # Observed min. nearest neighbour dist.
#> [1] 0.1005433
Approximate simulation conditional on number of points
X3 <- rmh(model, control = rmhcontrol(p=1), start = list(n.start = 42))
#> Checking arguments..determining simulation windows...Starting simulation.
#> Initial state...Ready to simulate. Generating proposal points...Running Metropolis-Hastings.
plot(X3, main = paste("Approx. sim. given n =", 42))
minnndist(X3) # Observed min. nearest neighbour dist.
#> [1] 0.1018068
OK, how about this? You just generate random number pairs without restriction and then remove the onces which are too close. This could be a great start for that:
minimumDistancePairs <- function(x, y, minDistance){
i <- 1
repeat{
distance <- sqrt((x-x[i])^2 + (y-y[i])^2) < minDistance # pythagorean theorem
distance[i] <- FALSE # distance to oneself is always zero
if(any(distance)) { # if too close to any other point
x <- x[-i] # remove element from x
y <- y[-i] # and remove element from y
} else { # otherwise...
i = i + 1 # repeat the procedure with the next element
}
if (i > length(x)) break
}
data.frame(x,y)
}
minimumDistancePairs(
c(0,3.9,4.1,8)
, c(1,4.1,3.9,7)
, 1
)
will lead to
x y
1 0.0 1.0
2 4.1 3.9
3 8.0 7.0
Be aware, though, of the fact that these are not random numbers anymore (however you solve problem).
You can use rejection sapling https://en.wikipedia.org/wiki/Rejection_sampling
The principle is simple: you resample until you data verify the condition.
> set.seed(1)
>
> x <- rnorm(2)
> y <- rnorm(2)
> (x[1]-x[2])^2+(y[1]-y[2])^2
[1] 6.565578
> while((x[1]-x[2])^2+(y[1]-y[2])^2 > 1) {
+ x <- rnorm(2)
+ y <- rnorm(2)
+ }
> (x[1]-x[2])^2+(y[1]-y[2])^2
[1] 0.9733252
>
The following is a naive hit-and-miss approach which for some choices of parameters (which were left unspecified in the question) works well. If performance becomes an issue, you could experiment with the package gpuR which has a GPU-accelerated distance matrix calculation.
rand.separated <- function(n,x0,x1,y0,y1,d,trials = 1000){
for(i in 1:trials){
nums <- cbind(runif(n,x0,x1),runif(n,y0,y1))
if(min(dist(nums)) >= d) return(nums)
}
return(NA) #no luck
}
This repeatedly draws samples of size n in [x0,x1]x[y0,y1] and then throws the sample away if it doesn't satisfy. As a safety, trials guards against an infinite loop. If solutions are hard to find or n is large you might need to increase or decrease trials.
For example:
> set.seed(2018)
> nums <- rand.separated(25,0,10,0,10,0.2)
> plot(nums)
runs almost instantly and produces:
Im not sure what you are asking.
if you want random coordinates here.
c(
runif(1,max=y[1],min=x[1]),
runif(1,max=y[2],min=x[2]),
runif(1,min=y[3],max=x[3]),
runif(1,min=y[4],max=x[4])
)

R : How can I minimize the function related to unknown parameter?

I want to make a function in R which minimizes the objective with unknown parameter.
The exact equation is
Q_beta=min_{beta} sum_{i=1}^{i=n} || x_i - f(beta) ||^2
Here, ||.|| means euclidean measure and I want to sum all n objects.
x_i is a vector and f(beta) is same dimension vector as x_i, and it contains unknown parameter beta.
In this case, I want to minimize summation of all n squared euclidean objective and I also want to know which beta minimizes them.
Can I make a function of this in R? what kind of function do I need to use?
Thank you.
Does this work?
rm(list=ls())
lower <- -2 ## lower interval endpoint of possible betas
upper <- 2 ## upper interval endpoint
beta0 <- runif(1,lower,upper) ## true beta, randomly selected from interval
t <- seq(0,1,by=.01) ## grid of values that function is fit over
x <- beta0*t^2
## goal is to find beta0
f <- function(beta) beta*t^2
g <- function(beta) sum((x-f(beta))^2)
fit <- optimize(g,lower=lower,upper=upper)
## the following two should match
fit$minimum
beta0

Generating Random Variables with given correlations between pairs of them:

I want to generate 2 continuous random variables Q1, Q2 (quantitative traits, each are normal) and 2 binary random variables Z1, Z2 (binary traits) with given pairwise correlations between all possible pairs of them.
Say
(Q1,Q2):0.23
(Q1,Z1):0.55
(Q1,Z2):0.45
(Q2,Z1):0.4
(Q2,Z2):0.5
(Z1,Z2):0.47
Please help me generate such data in R.
This is crude but might get you started in the right direction.
library(copula)
options(digits=3)
probs <- c(0.5,0.5)
corrs <- c(0.23,0.55,0.45,0.4,0.5,0.47) ## lower triangle
Simulate correlated values (first two quantitative, last two transformed to binary)
sim <- function(n,probs,corrs) {
tmp <- normalCopula( corrs, dim=4 , "un")
getSigma(tmp) ## test
x <- rCopula(1000, tmp)
x2 <- x
x2[,3:4] <- qbinom(x[,3:4],size=1,prob=rep(probs,each=nrow(x)))
x2
}
Test SSQ distance between observed and target correlations:
objfun <- function(corrs,targetcorrs,probs,n=1000) {
cc <- try(cor(sim(n,probs,corrs)),silent=TRUE)
if (is(cc,"try-error")) return(NA)
sum((cc[lower.tri(cc)]-targetcorrs)^2)
}
See how bad things are when input corrs=target:
cc0 <- cor(sim(1000,probs=probs,corrs=corrs))
cc0[lower.tri(cc0)]
corrs
objfun(corrs,corrs,probs=probs) ## 0.112
Now try to optimize.
opt1 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5))
opt1$value ## 0.0208
Stops after 501 iterations with "max iterations exceeded". This will never work really well because we're trying to use a deterministic hill-climbing algorithm on a stochastic objective function ...
cc1 <- cor(sim(1000,probs=c(0.5,0.5),corrs=opt1$par))
cc1[lower.tri(cc1)]
corrs
Maybe try simulated annealing?
opt2 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5),
method="SANN")
It doesn't seem to do much better than the previous value. Two possible problems (left as an exercise for the reader are) (1) we have specified a set of correlations that are not feasible with the marginal distributions we have chosen, or (2) the error in the objective function surface is getting in the way -- to do better we would have to average over more replicates (i.e. increase n).

Resources