Maximum likelihood estimation of the log-normal distribution using R - r

I'm trying to estimate a linear model with a log-normal distributed error term. I already have working code for a linear model with normally distributed errors:
library(Ecdat)
library(assertthat)
library(maxLik)
# Load the data
data(Wages1)
# Check what R says
summary(lm(wage ~ school + exper + sex, data = Wages1))
# Use maxLik from package maxLik
# The likelihood function
my_log_lik_pos <- function(theta, data){
y <- data[, 1]
x <- data[, -1]
beta <- head(theta, -1)
sigma <- tail(theta, 1)
xb <- x%*%beta
are_equal(dim(xb), c(nrow(my_data), 1))
return(sum(log(dnorm(y, mean = xb, sd = sigma))))
}
# Bind the data
my_data <- cbind(Wages1$wage, 1, Wages1$school, Wages1$exper, Wages1$sex)
my_problem <- maxLik(my_log_lik_pos, data = my_data,
start = rep(1,5), method = "BFGS")
summary(my_problem)
I get approximately the same results. Now I try to do the same, but using the log-normal likelihood. For this, I have to first simulate some data:
true_beta <- c(0.1, 0.2, 0.3, 0.4, 0.5)
ys <- my_data[, -1] %*% head(true_beta, -1) +
rlnorm(nrow(my_data), 0, tail(true_beta, 1))
my_data_2 <- cbind(ys, my_data[, -1])
And the log-likelihood function:
my_log_lik_lognorm <- function(theta, data){
y <- data[, 1]
x <- data[, -1]
beta <- head(theta, -1)
sigma <- tail(theta, 1)
xb <- x%*%beta
are_equal(dim(xb), c(nrow(data), 1))
return(sum(log(dlnorm(y, mean = xb, sd = sigma))))
}
my_problem2 <- maxLik(my_log_lik_lognorm, data = my_data_2,
start = rep(0.2,5), method = "BFGS")
summary(my_problem2)
The estimated parameters should be around the values of true_beta, but for some reason I find completely different values. I tried with different methods, different starting values but to no avail. I'm sure that I'm missing something obvious, but I don't see what.
Am I right to assume that the log-likelihood of the log-normal distribution is:
sum(log(dlnorm(y, mean = .., sd = ...))
Unless I'm mistaken, this is the definition of the log-likelihood (sum of the logs of the densities).

I found the issue: it seems the problem is not my log-likelihood function. When I try to estimate the model with glm:
summary(glm(ys ~ school + exper + sex, family=gaussian(link="log"), data=Wages1))
I get the same result as with maxLik and my log-likelihood. It would seem the problem comes from when I tried to simulate some data:
ys <- my_data[, -1] %*% head(true_beta, -1) +
rlnorm(nrow(my_data), 0, tail(true_beta, 1))
The correct way to simulate the data:
ys <- rlnorm(nrow(my_data), my_data[, -1] %*% head(true_beta, -1), tail(true_beta, 1))
Now everything works!

Related

How do I calculate cronbach's alpha on multiply imputed data?

I have run a multiple imputation (m=45, 10 iterations) using the MICE package, and want to calculate the cronbach's alpha for a number of ordinal scales in the data. Is there a function in r that could assist me in calculating the alpha coefficient across the imputed datasets in a manner that would satisfy Rubin's rules for pooling estimates?
We may exploit pool.scalar from the mice package, which performs pooling of univariate estimates according to Rubin's rules.
Since you have not provided a reproducible example yourself, I will provide one.
set.seed(123)
# sample survey responses
df <- data.frame(
x1 = c(1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3),
x2 = c(1,1,1,2,3,3,2,3,3,3,
1,1,1,2,3,3,2,3,3,3,
1,2,2,3,2,2,3,3,2,3),
x3 = c(1,1,2,1,2,3,3,3,2,3,
1,1,2,1,2,3,3,3,2,3,
1,2,2,3,2,2,3,3,2,3)
)
# function to column-wise generate missing values (MCAR)
create_missings <- function(data, prob) {
x <- replicate(ncol(data),rbinom(nrow(data), 1, prob))
for(k in 1:ncol(data)) {
data[, k] <- ifelse(x[, k] == 1, NA, data[,k])
}
data
}
df <- create_missings(df, prob = 0.2)
# multiple imputation ----------------------------------
library(mice)
imp <- mice(df, m = 10, maxit = 20)
# extract the completed data in long format
implong <- complete(imp, 'long')
We need a function to compute cronbach's alpha and obtain an estimate of the standard error of alpha, which can be used in a call to pool.scalar() later on. Since there is no available formula with which we can analytically estimate the standard error of alpha, we also need to deploy a bootstrapping procedure to estimate this standard error.
The function cronbach_fun() takes the following arguments:
list_compl_data: a character string specifying the list of completed data from a mids object.
boot: a logical indicating whether a non-parametrical bootstrap should be conducted.
B: an integer specifying the number of bootstrap samples to be taken.
ci: a logical indicating whether a confidence interval around alpha should be estimated.
cronbach_fun <- function(list_compl_data, boot = TRUE, B = 1e4, ci = FALSE) {
n <- nrow(list_compl_data); p <- ncol(list_compl_data)
total_variance <- var(rowSums(list_compl_data))
item_variance <- sum(apply(list_compl_data, 2, sd)^2)
alpha <- (p/(p - 1)) * (1 - (item_variance/total_variance))
out <- list(alpha = alpha)
boot_alpha <- numeric(B)
if (boot) {
for (i in seq_len(B)) {
boot_dat <- list_compl_data[sample(seq_len(n), replace = TRUE), ]
total_variance <- var(rowSums(boot_dat))
item_variance <- sum(apply(boot_dat, 2, sd)^2)
boot_alpha[i] <- (p/(p - 1)) * (1 - (item_variance/total_variance))
}
out$var <- var(boot_alpha)
}
if (ci){
out$ci <- quantile(boot_alpha, c(.025,.975))
}
return(out)
}
Now that we have our function to do the 'heavy lifting', we can run it on all m completed data sets, after which we can obtain Q and U (which are required for the pooling of the estimates). Consult ?pool.scalar for more information.
m <- length(unique(implong$.imp))
boot_alpha <- rep(list(NA), m)
for (i in seq_len(m)) {
set.seed(i) # fix random number generator
sub <- implong[implong$.imp == i, -c(1,2)]
boot_alpha[[i]] <- cronbach_fun(sub)
}
# obtain Q and U (see ?pool.scalar)
Q <- sapply(boot_alpha, function(x) x$alpha)
U <- sapply(boot_alpha, function(x) x$var)
# pooled estimates
pool_estimates <- function(x) {
out <- c(
alpha = x$qbar,
lwr = x$qbar - qt(0.975, x$df) * sqrt(x$t),
upr = x$qbar + qt(0.975, x$df) * sqrt(x$t)
)
return(out)
}
Output
# Pooled estimate of alpha (95% CI)
> pool_estimates(pool.scalar(Q, U))
alpha lwr upr
0.7809977 0.5776041 0.9843913

Running rJAGS when the likelihood is a custom density

I am trying to figure out how to sample from a custom density in rJAGS but am running into issues. having searched the site, I saw that there is a zeroes (or ones) trick that can be employed based on BUGS code but am having a hard time with its implementation in rJAGS. I think I am doing it correctly but keep getting the following error:
Error in jags.model(model1.spec, data = list(x = x, N = N), n.chains = 4, :
Error in node dpois(lambda)
Length mismatch in Node::setValue
Here is my rJAGS code for reproducibility:
library(rjags)
set.seed(4)
N = 100
x = rexp(N, 3)
L = quantile(x, prob = 1) # Censoring point
censor = ifelse(x <= L, 1, 0) # Censoring indicator
x[censor == 1] <- L
model1.string <-"
model {
for (i in 1:N){
x[i] ~ dpois(lambda)
lambda <- -N*log(1-exp(-(1/mu)))
}
mu ~ dlnorm(mup, taup)
mup <- log(.0001)
taup <- 1/49
R <- 1 - exp(-(1/mu) * .0001)
}
"
model1.spec<-textConnection(model1.string)
jags <- jags.model(model1.spec,
data = list('x' = x,
'N' = N),
n.chains=4,
n.adapt=100)
Here, my negative log likelihood of the density I am interested in is -N*log(1-exp(-(1/mu))). Is there an obvious mistake in the code?
Using the zeros trick, the variable on the left-hand side of the dpois() relationship has to be an N-length vector of zeros. The variable x should show up in the likelihood somewhere. Here is an example using the normal distribution.
set.seed(519)
N <- 100
x <- rnorm(100, mean=3)
z <- rep(0, N)
C <- 10
pi <- pi
model1.string <-"
model {
for (i in 1:N){
lambda[i] <- pow(2*pi*sig2, -0.5) * exp(-.5*pow(x[i]-mu, 2)/sig2)
loglam[i] <- log(lambda[i]) + C
z[i] ~ dpois(loglam[i])
}
mu ~ dnorm(0,.1)
tau ~ dgamma(1,.1)
sig2 <- pow(tau, -1)
sumLL <- sum(log(lambda[]))
}
"
model1.spec<-textConnection(model1.string)
set.seed(519)
jags <- jags.model(model1.spec,
data = list('x' = x,
'z' = z,
'N' = N,
'C' = C,
'pi' = pi),
inits = function()list(tau = 1, mu = 3),
n.chains=4,
n.adapt=100)
samps1 <- coda.samples(jags, c("mu", "sig2"), n.iter=1000)
summary(samps1)
Iterations = 101:1100
Thinning interval = 1
Number of chains = 4
Sample size per chain = 1000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu 4.493 2.1566 0.034100 0.1821
sig2 1.490 0.5635 0.008909 0.1144
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu 0.6709 3.541 5.218 5.993 7.197
sig2 0.7909 0.999 1.357 1.850 2.779

Fitting probit model inr R

For my thesis I have to fit some glm models with MLEs that R doesn't have, I was going ok for the models with close form but now I have to use de Gausian CDF, so i decide to fit a simple probit model.
this is the code:
Data:
set.seed(123)
x <-matrix( rnorm(50,2,4),50,1)
m <- matrix(runif(50,2,4),50,1)
t <- matrix(rpois(50,0.5),50,1)
z <- (1+exp(-((x-mean(x)/sd(x)))))^-1 + runif(50)
y <- ifelse(z < 1.186228, 0, 1)
data1 <- as.data.frame(cbind(y,x,m,t))
myprobit <- function (formula, data)
{
mf <- model.frame(formula, data)
y <- model.response(mf, "numeric")
X <- model.matrix(formula, data = data)
if (any(is.na(cbind(y, X))))
stop("Some data are missing.")
loglik <- function(betas, X, y, sigma) { #loglikelihood
p <- length(betas)
beta <- betas[-p]
eta <- X %*% beta
sigma <- 1 #because of identification, sigma must be equal to 1
G <- pnorm(y, mean = eta,sd=sigma)
sum( y*log(G) + (1-y)*log(1-G))
}
ls.reg <- lm(y ~ X - 1)#starting values using ols, indicating that this model already has a constant
start <- coef(ls.reg)
fit <- optim(start, loglik, X = X, y = y, control = list(fnscale = -1), method = "BFGS", hessian = TRUE) #optimizar
if (fit$convergence > 0) {
print(fit)
stop("optim failed to converge!") #verify convergence
}
return(fit)
}
myprobit(y ~ x + m + t,data = data1)
And i get: Error in X %*% beta : non-conformable arguments, if i change start <- coef(ls.reg) with start <- c(coef(ls.reg), 1) i get wrong stimatives comparing with:
probit <- glm(y ~ x + m + t,data = data1 , family = binomial(link = "probit"))
What am I doing wrong?
Is possible to correctly fit this model using pnorm, if no, what algorithm should I use to approximate de gausian CDF. Thanks!!
The line of code responsible for your error is the following:
eta <- X %*% beta
Note that "%*%" is the matrix multiplication operator. By reproducing your code I noticed that X is a matrix with 50 rows and 4 columns. Hence, for matrix multiplication to be possible your "beta" needs to have 4 rows. But when you run "betas[-p]" you subset the betas vector by removing its last element, leaving only three elements instead of the four you need for matrix multiplication to be defined. If you remove [-p] the code will work.

Predict function for lm object in R

Why are prediction_me and prediction_R not equal? I'm attempting to follow the formula given by Lemma 5 here. Does the predict function use a different formula, have I made a mistake in my computation somewhere, or is it just rounding error? (the two are pretty close)
set.seed(100)
# genrate data
x <- rnorm(100, 10)
y <- 3 + x + rnorm(100, 5)
data <- data.frame(x = x, y = y)
# fit model
mod <- lm(y ~ x, data = data)
# new observation
data2 <- data.frame(x = rnorm(5, 10))
# prediction for new observation
d <- as.matrix(cbind(1, data[,-2]))
d2 <- as.matrix(cbind(1, data2))
fit <- d2 %*% mod$coefficients
t <- qt(1 - .025, mod$df.residual)
s <- summary(mod)$sigma
half <- as.vector(t*s*sqrt(1 + d2%*%solve(t(d)%*%d, t(d2))))
prediction_me <- cbind(fit, fit - half, fit + half)
prediction_R <- predict(mod, newdata = data2, interval = 'prediction')
prediction_me
prediction_R
Your current code is almost fine. Just note that the formula in Lemma 5 is for a single newly observed x. For this reason, half contains not only relevant variances but also covariances, while you only need the former ones. Thus, as.vector should be replaced with diag:
half <- diag(t * s * sqrt(1 + d2 %*% solve(t(d) %*%d , t(d2))))
prediction_me <- cbind(fit, fit - half, fit + half)
prediction_R <- predict(mod, newdata = data2, interval = 'prediction')
range(prediction_me - prediction_R)
# [1] 0 0

How to estimate the Kalman Filter with 'KFAS' R package, with an AR(1) transition equation?

I am using 'KFAS' package from R to estimate a state-space model with the Kalman filter. My measurement and transition equations are:
y_t = Z_t * x_t + \eps_t (measurement)
x_t = T_t * x_{t-1} + R_t * \eta_t (transition),
with \eps_t ~ N(0,H_t) and \eta_t ~ N(0,Q_t).
So, I want to estimate the variances H_t and Q_t, but also T_t, the AR(1) coefficient. My code is as follows:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
fit <- fitSSM(ss_model, inits = c(0,0.6,0), method = 'L-BFGS-B')
But it returns: "Error in is.SSModel(do.call(updatefn, args = c(list(inits, model), update_args)),: System matrices (excluding Z) contain NA or infinite values, covariance matrices contain values larger than 1e+07"
The NA definitions for the variances works well, as documented in the package's paper. However, it seems this cannot be done for the AR coefficients. Does anyone know how can I do this?
Note that I am aware of the SSMarima function, which eases the definition of the transition equation as ARIMA models. Although I am able to estimate the AR(1) coef. and Q_t this way, I still cannot estimate the \eps_t variance (H_t). Moreover, I am migrating my Kalman filter codes from EViews to R, so I need to learn SSMcustom for other models that are more complicated.
Thanks!
It seems that you are missing something in your example, as your error message comes from the function fitSSM. If you want to use fitSSM for estimating general state space models, you need to provide your own model updating function. The default behaviour can only handle NA's in covariance matrices H and Q. The main goal of fitSSM is just to get started with simple stuff. For complex models and/or large data, I would recommend using your self-written objective function (with help of logLik method) and your favourite numerical optimization routines manually for maximum performance. Something like this:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
objf <- function(pars, model, estimate = TRUE) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
if (estimate) {
-logLik(model)
} else {
model
}
}
opt <- optim(c(1, 0.5, 1), objf, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100), model = ss_model)
ss_model_opt <- objf(opt$par, ss_model, estimate = FALSE)
Same with fitSSM:
updatefn <- function(pars, model) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
model
}
fit <- fitSSM(ss_model, c(1, 0.5, 1), updatefn, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100))
identical(ss_model_opt, fit$model)

Resources