I am trying to use the command mle2, in the package bbmle. I am looking at p2 of "Maximum likelihood estimation and analysis with the bbmle package" by Bolker. Somehow I fail to enter the right start values. Here's the reproducible code:
l.lik.probit <-function(par, ivs, dv){
Y <- as.matrix(dv)
X <- as.matrix(ivs)
K <-ncol(X)
b <- as.matrix(par[1:K])
phi <- pnorm(X %*% b)
sum(Y * log(phi) + (1 - Y) * log(1 - phi))
}
n=200
set.seed(1000)
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
latentz<- 1 + 2.0 * x1 + 3.0 * x2 + 5.0 * x3 + 8.0 * x4 + rnorm(n,0,5)
y <- latentz
y[latentz < 1] <- 0
y[latentz >=1] <- 1
x <- cbind(1,x1,x2,x3,x4)
values.start <-c(1,1,1,1,1)
foo2<-mle2(l.lik.probit, start=list(dv=0,ivs=values.start),method="BFGS",optimizer="optim", data=list(Y=y,X=x))
And this is the error I get:
Error in mle2(l.lik.probit, start = list(Y = 0, X = values.start), method = "BFGS", :
some named arguments in 'start' are not arguments to the specified log-likelihood function
Any idea why? Thanks for your help!
You've missed a couple of things, but the most important is that by default mle2 takes a list of parameters; you can make it take a parameter vector instead, but you have to work a little bit harder.
I have tweaked the code slightly in places. (I changed the log-likelihood function to a negative log-likelihood function, without which this would never work!)
l.lik.probit <-function(par, ivs, dv){
K <- ncol(ivs)
b <- as.matrix(par[1:K])
phi <- pnorm(ivs %*% b)
-sum(dv * log(phi) + (1 - dv) * log(1 - phi))
}
n <- 200
set.seed(1000)
dat <- data.frame(x1=rnorm(n),
x2=rnorm(n),
x3=rnorm(n),
x4=rnorm(n))
beta <- c(1,2,3,5,8)
mm <- model.matrix(~x1+x2+x3+x4,data=dat)
latentz<- rnorm(n,mean=mm%*%beta,sd=5)
y <- latentz
y[latentz < 1] <- 0
y[latentz >=1] <- 1
x <- mm
values.start <- rep(1,5)
Now we do the fit. The main thing is to specify vecpar=TRUE and to use parnames to let mle2 know the names of the elements in the parameter vector ...
library("bbmle")
names(values.start) <- parnames(l.lik.probit) <- paste0("b",0:4)
m1 <- mle2(l.lik.probit, start=values.start,
vecpar=TRUE,
method="BFGS",optimizer="optim",
data=list(dv=y,ivs=x))
As pointed out above for this particular example you have just re-implemented the probit regression (although I understand that you now want to extend this to allow for heteroscedasticity in some way ...)
dat2 <- data.frame(dat,y)
m2 <- glm(y~x1+x2+x3+x4,family=binomial(link="probit"),
data=dat2)
As a final note, I would say that you should check out the parameters argument, which allows you to specify a sub-linear model for any one of the parameters, and the formula interface:
m3 <- mle2(y~dbinom(prob=pnorm(eta),size=1),
parameters=list(eta~x1+x2+x3+x4),
start=list(eta=0),
data=dat2)
PS confint(foo2) appears to work fine (giving profile CIs as requested) with this set-up.
ae <- function(x,y) all.equal(unname(coef(x)),unname(coef(y)),tol=5e-5)
ae(m1,m2) && ae(m2,m3)
Related
I'm trying to get the parameters w, lambda_1, lambda_2 and p from a mixture bi-exponential model, using a loglikelihood function and the optim function in R. The model is the following
Here is the code
biexpLL <- function(theta, y) {
# define parameters
w <- theta[1]
lambda_1 <- theta[2]
a <- theta[3]
lambda_2 <- theta[4]
# likelihood function with dexp
l <- w * dexp((y - a), rate = 1/lambda_1) + (1 - w) * dexp((y - a), rate = 1/lambda_2)
- sum(log(l))
}
# Generate some fake data
w <- 0.7
n <- 500
lambda_1 <- 2
lambda_2 <- 0.2
set.seed(45)
biexp_data <- (w * rexp(n, 1/lambda_1) + (1 - w) * rexp(n, 1/lambda_2))
# Optimization
optim(par = c(0.5,0.1,0.001,0.2),
fn=biexpLL,
y=biexp_data)
#$par
#[1] -94789220.4 16582.9 -333331.7 134744336.2
The parameters are very different from the used in the fake data! What I'm doing wrong?
The original code is prone to warnings and errors since the parameters may go to invalid values easily. For example, we need w in [0, 1] and lambda > 0. Also, if a is larger than a data point, then the density becomes zero, hence infinite log likelihood.
The code below uses some tricks to handle these cases.
w is converted to the range [0, 1] by the logistic function
lambda are converted to positive values by the exponential function.
Added tiny value to the likelihood to deal with cases of zero likelihood.
Also, the data generation process has been changed so that samples are generated from one of the exponential distributions with the given probability w.
Finally, increased the sample size since the result was not stable with n=500.
biexpLL <- function(theta, y) {
# define parameters
w <- 1/(1+exp(-theta[1]))
lambda_1 <- exp(theta[2])
a <- theta[3]
lambda_2 <- exp(theta[4])
# likelihood function with dexp
l <- w * dexp((y - a), rate = 1/lambda_1) + (1 - w) * dexp((y - a), rate = 1/lambda_2)
- sum(log(l + 1e-9))
}
# Generate some fake data
w <- 0.7
n <- 5000
lambda_1 <- 2
lambda_2 <- 0.2
set.seed(45)
n1 <- round(n*w)
n2 <- n - n1
biexp_data <- c(rexp(n1, rate=1/lambda_1),
rexp(n2, rate=1/lambda_2))
# Optimization
o <- optim(par=c(0.5,0.1,0.001,0.2),
fn=biexpLL,
y=biexp_data)
1/(1+exp(-o$par[1]))
exp(o$par[2])
o$par[3]
exp(o$par[4])
On my environment I obtained the below.
The result seems reasonably close to the simulation parameters (note that two lambda values are swapped).
> 1/(1+exp(-o$par[1]))
[1] 0.3458264
> exp(o$par[2])
[1] 0.1877655
> o$par[3]
[1] 3.738172e-05
> exp(o$par[4])
[1] 2.231844
Notice that for mixture models of this kind, people often use the EM algorithm for optimizing the likelihood instead of the direct optimization as this. You may want to have a look at it as well.
I have been able to get the parameters with the R package DEoptim :
library(DEoptim)
biexpLL <- function(theta, y)
{
w <- theta[1]
lambda_1 <- theta[2]
lambda_2 <- theta[3]
l <- w * dexp(y, rate = 1 / lambda_1) + (1 - w) * dexp(y, rate = 1 / lambda_2)
log_Lik <- -sum(log(l))
if(is.infinite(log_Lik))
{
return(10 ^ 30)
}else
{
return(log_Lik)
}
}
w <- 0.7
n <- 500
lambda_1 <- 2
lambda_2 <- 0.2
set.seed(45)
indicator <- rbinom(n = 500, size = 1, prob = w)
biexp_data <- (indicator * rexp(n, 1 / lambda_1) + (1 - indicator) * rexp(n, 1 / lambda_2))
obj_DEoptim <- DEoptim(fn = biexpLL, lower = c(0, 0, 0), upper = c(1, 1000, 1000), control = list(itermax = 1000, parallelType = 1), y = biexp_data)
obj_DEoptim$optim$bestmem
par1 par2 par3
0.7079678 2.2906098 0.2026040
Why are prediction_me and prediction_R not equal? I'm attempting to follow the formula given by Lemma 5 here. Does the predict function use a different formula, have I made a mistake in my computation somewhere, or is it just rounding error? (the two are pretty close)
set.seed(100)
# genrate data
x <- rnorm(100, 10)
y <- 3 + x + rnorm(100, 5)
data <- data.frame(x = x, y = y)
# fit model
mod <- lm(y ~ x, data = data)
# new observation
data2 <- data.frame(x = rnorm(5, 10))
# prediction for new observation
d <- as.matrix(cbind(1, data[,-2]))
d2 <- as.matrix(cbind(1, data2))
fit <- d2 %*% mod$coefficients
t <- qt(1 - .025, mod$df.residual)
s <- summary(mod)$sigma
half <- as.vector(t*s*sqrt(1 + d2%*%solve(t(d)%*%d, t(d2))))
prediction_me <- cbind(fit, fit - half, fit + half)
prediction_R <- predict(mod, newdata = data2, interval = 'prediction')
prediction_me
prediction_R
Your current code is almost fine. Just note that the formula in Lemma 5 is for a single newly observed x. For this reason, half contains not only relevant variances but also covariances, while you only need the former ones. Thus, as.vector should be replaced with diag:
half <- diag(t * s * sqrt(1 + d2 %*% solve(t(d) %*%d , t(d2))))
prediction_me <- cbind(fit, fit - half, fit + half)
prediction_R <- predict(mod, newdata = data2, interval = 'prediction')
range(prediction_me - prediction_R)
# [1] 0 0
I have a time-series which I need to fit onto an AR (auto-regression) model.
The AR model has the form:
x(t) = a0 + a1*x(t-1) + a2*x(t-2) + ... + aq*x(t-q) + noise.
I have two contraints:
Find the best AR fit when lag.max = 50.
Sum of all coefficients a0 + a1 + ... + aq = 1
I wrote the below code:
require(FitAR)
data(lynx) # my real data comes from the stock market.
z <- -log(lynx)
#find best model
step <- SelectModel(z, ARModel = "AR" ,lag.max = 50, Criterion = "AIC",Best=10)
summary(step) # display results
# fit the model and get coefficients
arfit <- ar(z,p=1, order.max=ceil(mean(step[,1])), aic=FALSE)
#check if sum of coefficients are 1
sum(arfit$ar)
[1] 0.5784978
My question is, how to add the constraint: sum of all coefficients = 1?
I looked at this question, but I do not realize how to use it.
**UPDATE**
I think I manage to solve my question as follow.
library(quadprog)
coeff <- arfit$ar
y <- 0
for (i in 1:length(coeff)) {
y <- y + coeff[i]*c(z[(i+1):length(z)],rep(0,i))
ifelse (i==1, X <- c(z[2:length(z)],0), X <- cbind(X,c(z[(i+1):length(z)],rep(0,i))))
}
Dmat <- t(X) %*% X
s <- solve.QP(Dmat , t(y) %*% X, matrix(1, nr=15, nc=1), 1, meq=1 )
s$solution
# The coefficients should sum up to 1
sum(s$solution)
I am trying to find a way to estimate the predicted values of y with confidence intervals for specific values of x in an OLS regression. My model includes an interaction term and I use clustered standard errors and weights in my model.
A similar question was asked and answer previously, I thought it could be a good starting point:
robust standard errors in ggplot2
The problem is that, the solution offered here does not work when there are interaction terms OR weights in the model. It does produces an outcome when there are both weights and interaction terms. I found this confusing but I am relatively new to R and I could not understand the source of the problem.
In the second and third examples (lm2 & lm3) I get "Error in X %*% V : non-conformable arguments". My best guess for the source of the error in the third case is that model.frame(lm3) does not include interaction terms. But I don’t know whether I am in the right track and could not find a way to fix it. Besides it's not clear to me how I can set x1 to a specific value in this example. Can someone help me revise to code above or offer an alternative way to get my predicted standard errors when x is set to a specific value ?
df <- data.frame(x1 = rnorm(100), x2 = rnorm(100), w1 = runif(100,0.1,2),y = rnorm(100), group = as.factor(sample(1:10, 100, replace=T)))
lm1 <- lm(y ~ x1+x2, data = df)
lm2 <- lm(y ~ x1+x2, data = df, weight=w1)
lm3 <- lm(y ~ x1*x2, data = df)
lm4 <- lm(y ~ x1*x2, data = df, weight=w1)
getvcov <- function(fm,dfcw,cluster) {
library(sandwich);library(lmtest)
M <- length(unique(cluster))
N <- length(cluster)
K <- fm$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
dfc*sandwich(fm, meat=crossprod(uj)/N)*dfcw
}
V <- getvcov(lm1,1,df$group)
X <- as.matrix(model.frame(lm1))
se <- predict(lm1,se=TRUE)$se.fit
se_robust1 <- sqrt(diag(X %*% V %*% t(X)))
V <- getvcov(lm2,1,df$group)
X <- as.matrix(model.frame(lm2))
se <- predict(lm2,se=TRUE)$se.fit
se_robust2 <- sqrt(diag(X %*% V %*% t(X)))
V <- getvcov(lm3,1,df$group)
X <- as.matrix(model.frame(lm3))
se <- predict(lm3,se=TRUE)$se.fit
se_robust2 <- sqrt(diag(X %*% V %*% t(X)))
V <- getvcov(lm4,1,df$group)
X <- as.matrix(model.frame(lm4))
se <- predict(lm4,se=TRUE)$se.fit
se_robust4 <- sqrt(diag(X %*% V %*% t(X)))
I am trying to simulate a three-variable dataset so that I can run linear regression models on it. 'X1' and 'X2' would be continuous independent variables (mean=0, sd=1), and 'Y' would be the continuous dependent variable.
The variables will be regression model will produce coefficients like this:
Y = 5 + 3(X1) - 2(X2)
I would like to simulate this dataset such that the resulting regression model has an R-squared value of 0.2. How can I determine the value of 'sd.value' so that the regression model has this R-squared?
n <- 200
set.seed(101)
sd.value <- 1
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))
Take a look at this code, it should be close enough to what you want:
simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) {
stopifnot(length(beta) == 3)
df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs)) # x1 and x2 are independent
var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq
stopifnot(var.epsilon > 0)
df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon)
return(df)
}
get.R.sq <- function(desired) {
model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired))
return(summary(model)$r.squared)
}
df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05))
df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq)
plot(df)
abline(a=0, b=1, col="red", lty=2)
Basically your question comes down to figuring out the expression for var.epsilon. Since we have y = b1 + b2*x1 + b3*x2 + epsilon, and Xs and epsilon are all independent, we have var[y] = b2^2 * var[x1] + b3^2 * var[x2] + var[eps], where the var[Xs]=1 by assumption. You can then solve for var[eps] as a function of R-squared.
So the formula for R^2 is 1-var(residual)/var(total)
In this case, the variance of Y is going to be 3^2+2^2+sd.value^2, since we are adding three independent random variables. And, asymptotically, the residual variance is going to be simply sd.value^2.
So you can compute rsquared explicitly with this function:
rsq<-function(x){1-x^2/(9+ 4+x^2)}
With a little algebra, you can compute the inverse of this function:
rsqi<-function(x){sqrt(13)*sqrt((1-x)/x)}
So setting sd.value<-rsqi(rsquared) should give you what you want.
We can test this as follows:
simrsq<-function(x){
Y <- rnorm(n, (5 + 3*X1 - 2*X2), rsqi(x))
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))$r.squared
}
> meanrsq<-rep(0,9)
> for(i in 1:50)
+ meanrsq<-meanrsq+Vectorize(simrsq)((1:9)/10)
> meanrsq/50
[1] 0.1031827 0.2075984 0.3063701 0.3977051 0.5052408 0.6024988 0.6947790
[8] 0.7999349 0.8977187
So it looks to be correct.
This is how I would do it (blind iterative algorithm, assuming no knowledge, for when you are purely interested in "how to simulate this"):
simulate.sd <- function(nsim=10, n=200, seed=101, tol=0.01) {
set.seed(seed)
sd.value <- 1
rsquare <- 1:nsim
results <- 1:nsim
for (i in 1:nsim) {
# tracking iteration: if we miss the value, abort at sd.value > 7.
iter <- 0
while (rsquare[i] > (0.20 + tol) | rsquare[i] < (0.2 - tol)) {
sd.value <- sd.value + 0.01
rsquare[i] <- simulate.sd.iter(sd.value, n)
iter <- iter + 1
if (iter > 3000) { break }
}
results[i] <- sd.value # store the current sd.value that is OK!
sd.value <- 1
}
cbind(results, rsquare)
}
simulate.sd.iter <- function(sd.value, n=200) { # helper function
# Takes the sd.value, creates data, and returns the r-squared
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
return(summary(lm(Y ~ X1 + X2, data=simdata))$r.squared)
}
simulate.sd()
A few things to note:
I let the X1 and X2 vary, since this affects this sought sd.value.
The tolerance is how exact you want this estimate to be. Are you fine with an r-squared of ~0.19 or ~0.21? Have the tolerance be 0.01.
Note that a too precise tolerance might not allow you to find a result.
The value of 1 is quite a bad starting value, making this iterative algorithm quite slow.
The resulting vector for 10 results is:
[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55,
which takes roughly 13 seconds on my machine.
My next step would be to start from 4.5, add 0.001 to the iteration instead of 0.01, and perhaps lower the tolerance. Good luck!
Alright, some summary statistics for nsim=100, taking 150 seconds, with steps increase of 0.001, and tolerance still at 0.01:
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.513 4.913 5.036 5.018 5.157 5.393
Why are you interested in this though?
Here is another code to generate multiple linear regression with errors follow normal distribution:
OPS sorry this code just produces multiple regression
sim.regression<-function(n.obs=10,coefficients=runif(10,-5,5),s.deviation=.1){
n.var=length(coefficients)
M=matrix(0,ncol=n.var,nrow=n.obs)
beta=as.matrix(coefficients)
for (i in 1:n.var){
M[,i]=rnorm(n.obs,0,1)
}
y=M %*% beta + rnorm(n.obs,0,s.deviation)
return (list(x=M,y=y,coeff=coefficients))
}