Related
I have run a multiple imputation (m=45, 10 iterations) using the MICE package, and want to calculate the cronbach's alpha for a number of ordinal scales in the data. Is there a function in r that could assist me in calculating the alpha coefficient across the imputed datasets in a manner that would satisfy Rubin's rules for pooling estimates?
We may exploit pool.scalar from the mice package, which performs pooling of univariate estimates according to Rubin's rules.
Since you have not provided a reproducible example yourself, I will provide one.
set.seed(123)
# sample survey responses
df <- data.frame(
x1 = c(1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3),
x2 = c(1,1,1,2,3,3,2,3,3,3,
1,1,1,2,3,3,2,3,3,3,
1,2,2,3,2,2,3,3,2,3),
x3 = c(1,1,2,1,2,3,3,3,2,3,
1,1,2,1,2,3,3,3,2,3,
1,2,2,3,2,2,3,3,2,3)
)
# function to column-wise generate missing values (MCAR)
create_missings <- function(data, prob) {
x <- replicate(ncol(data),rbinom(nrow(data), 1, prob))
for(k in 1:ncol(data)) {
data[, k] <- ifelse(x[, k] == 1, NA, data[,k])
}
data
}
df <- create_missings(df, prob = 0.2)
# multiple imputation ----------------------------------
library(mice)
imp <- mice(df, m = 10, maxit = 20)
# extract the completed data in long format
implong <- complete(imp, 'long')
We need a function to compute cronbach's alpha and obtain an estimate of the standard error of alpha, which can be used in a call to pool.scalar() later on. Since there is no available formula with which we can analytically estimate the standard error of alpha, we also need to deploy a bootstrapping procedure to estimate this standard error.
The function cronbach_fun() takes the following arguments:
list_compl_data: a character string specifying the list of completed data from a mids object.
boot: a logical indicating whether a non-parametrical bootstrap should be conducted.
B: an integer specifying the number of bootstrap samples to be taken.
ci: a logical indicating whether a confidence interval around alpha should be estimated.
cronbach_fun <- function(list_compl_data, boot = TRUE, B = 1e4, ci = FALSE) {
n <- nrow(list_compl_data); p <- ncol(list_compl_data)
total_variance <- var(rowSums(list_compl_data))
item_variance <- sum(apply(list_compl_data, 2, sd)^2)
alpha <- (p/(p - 1)) * (1 - (item_variance/total_variance))
out <- list(alpha = alpha)
boot_alpha <- numeric(B)
if (boot) {
for (i in seq_len(B)) {
boot_dat <- list_compl_data[sample(seq_len(n), replace = TRUE), ]
total_variance <- var(rowSums(boot_dat))
item_variance <- sum(apply(boot_dat, 2, sd)^2)
boot_alpha[i] <- (p/(p - 1)) * (1 - (item_variance/total_variance))
}
out$var <- var(boot_alpha)
}
if (ci){
out$ci <- quantile(boot_alpha, c(.025,.975))
}
return(out)
}
Now that we have our function to do the 'heavy lifting', we can run it on all m completed data sets, after which we can obtain Q and U (which are required for the pooling of the estimates). Consult ?pool.scalar for more information.
m <- length(unique(implong$.imp))
boot_alpha <- rep(list(NA), m)
for (i in seq_len(m)) {
set.seed(i) # fix random number generator
sub <- implong[implong$.imp == i, -c(1,2)]
boot_alpha[[i]] <- cronbach_fun(sub)
}
# obtain Q and U (see ?pool.scalar)
Q <- sapply(boot_alpha, function(x) x$alpha)
U <- sapply(boot_alpha, function(x) x$var)
# pooled estimates
pool_estimates <- function(x) {
out <- c(
alpha = x$qbar,
lwr = x$qbar - qt(0.975, x$df) * sqrt(x$t),
upr = x$qbar + qt(0.975, x$df) * sqrt(x$t)
)
return(out)
}
Output
# Pooled estimate of alpha (95% CI)
> pool_estimates(pool.scalar(Q, U))
alpha lwr upr
0.7809977 0.5776041 0.9843913
I am trying to figure out how to sample from a custom density in rJAGS but am running into issues. having searched the site, I saw that there is a zeroes (or ones) trick that can be employed based on BUGS code but am having a hard time with its implementation in rJAGS. I think I am doing it correctly but keep getting the following error:
Error in jags.model(model1.spec, data = list(x = x, N = N), n.chains = 4, :
Error in node dpois(lambda)
Length mismatch in Node::setValue
Here is my rJAGS code for reproducibility:
library(rjags)
set.seed(4)
N = 100
x = rexp(N, 3)
L = quantile(x, prob = 1) # Censoring point
censor = ifelse(x <= L, 1, 0) # Censoring indicator
x[censor == 1] <- L
model1.string <-"
model {
for (i in 1:N){
x[i] ~ dpois(lambda)
lambda <- -N*log(1-exp(-(1/mu)))
}
mu ~ dlnorm(mup, taup)
mup <- log(.0001)
taup <- 1/49
R <- 1 - exp(-(1/mu) * .0001)
}
"
model1.spec<-textConnection(model1.string)
jags <- jags.model(model1.spec,
data = list('x' = x,
'N' = N),
n.chains=4,
n.adapt=100)
Here, my negative log likelihood of the density I am interested in is -N*log(1-exp(-(1/mu))). Is there an obvious mistake in the code?
Using the zeros trick, the variable on the left-hand side of the dpois() relationship has to be an N-length vector of zeros. The variable x should show up in the likelihood somewhere. Here is an example using the normal distribution.
set.seed(519)
N <- 100
x <- rnorm(100, mean=3)
z <- rep(0, N)
C <- 10
pi <- pi
model1.string <-"
model {
for (i in 1:N){
lambda[i] <- pow(2*pi*sig2, -0.5) * exp(-.5*pow(x[i]-mu, 2)/sig2)
loglam[i] <- log(lambda[i]) + C
z[i] ~ dpois(loglam[i])
}
mu ~ dnorm(0,.1)
tau ~ dgamma(1,.1)
sig2 <- pow(tau, -1)
sumLL <- sum(log(lambda[]))
}
"
model1.spec<-textConnection(model1.string)
set.seed(519)
jags <- jags.model(model1.spec,
data = list('x' = x,
'z' = z,
'N' = N,
'C' = C,
'pi' = pi),
inits = function()list(tau = 1, mu = 3),
n.chains=4,
n.adapt=100)
samps1 <- coda.samples(jags, c("mu", "sig2"), n.iter=1000)
summary(samps1)
Iterations = 101:1100
Thinning interval = 1
Number of chains = 4
Sample size per chain = 1000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu 4.493 2.1566 0.034100 0.1821
sig2 1.490 0.5635 0.008909 0.1144
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu 0.6709 3.541 5.218 5.993 7.197
sig2 0.7909 0.999 1.357 1.850 2.779
I would like to perform 10-fold cross validation manually using prostate data to learn how to do it manually. I utilise the elasticnet package for code. I estimated the parameters by glmnet package (of course, it can perform cross validation too, but I would like to do that manually). After the analysis, It seems to me that I need a different criterion to choose tuning parameter other than minimum of cv.error because this gives the almost null model, if not so "where is my mistake?". (According to the original paper of Tibshirani, optimum model has three variables)
Here is the code
library(ElemStatLearn)
library(glmnet)
x <- scale(prostate[,1:8],T,T)
y <- scale(prostate[,9],T,F)
lambda = seq(0,1,0.02)
cv.folds <- function(n, folds = 10){
split(sample(1:n), rep(1:folds, length = n))
}
c.val <- function(x, y, K = 10, lambda, plot.it = TRUE){
n <- nrow(x)
all.folds <- cv.folds(length(y), K)
residmat <- matrix(0, length(lambda), K)
for(i in seq(K)) {
omit <- all.folds[[i]]
xk <- as.matrix(x[-omit, ])
yk <- as.vector(y[-omit])
xg <- x[omit, ]
yg <- y[omit]
fit <- glmnet(xk, yk, family="gaussian",
alpha=1, lambda=lambda,standardize = FALSE, intercept = FALSE)
fit <- predict(fit,newx=xg,lambda=lambda)
if(length(omit)==1){fit<-matrix(fit,nrow=1)}
residmat[, i] <- apply((yg - fit)^2, 2, mean)
}
cv <- apply(residmat, 1, mean)
cv.error <- sqrt(apply(residmat, 1, var)/K)
object<-list(lambda = lambda, cv = cv, cv.error = cv.error)
if(plot.it) {
plot(lambda, cv, type = "b", xlab="lambda", ylim = range(cv, cv + cv.error, cv - cv.error))
invisible(object)
}
}
result <- c.val(x,y,K = 10,lambda = lambda)
lambda.opt <- lambda[which.min(result$cv.error)]
fit <- glmnet(x, y, family="gaussian",
alpha=1, lambda=lambda.opt,standardize = FALSE, intercept = FALSE)
coef(fit)
Result:
> coef(fit)
9 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) .
lcavol 0.01926724
lweight .
age .
lbph .
svi .
lcp .
Edit:
Model generated directly from glmnet.
fit.lasso <- glmnet(x, y, family="gaussian", alpha=1,
standardize = FALSE, intercept = FALSE)
fit.lasso.cv <- cv.glmnet(x, y, type.measure="mse", alpha=1,
family="gaussian",standardize = FALSE, intercept = FALSE)
coef.lambda.min <- coef(fit.lasso.cv,s=fit.lasso.cv$lambda.min)
coef.lambda.1se <- coef(fit.lasso.cv,s=fit.lasso.cv$lambda.1se)
cbind(coef.lambda.min,coef.lambda.1se)
Result:
9 x 2 sparse Matrix of class "dgCMatrix"
1 1
(Intercept) . .
lcavol 0.59892674 0.5286355
lweight 0.23669159 0.1201279
age -0.06979581 .
lbph 0.09392021 .
svi 0.24620007 0.1400748
lcp . .
gleason 0.00346421 .
pgg45 0.06631013 .
The second column shows the correct (lambda.1se) result.
Your "mistake" is very hard to spot: it comes from the fact that glmnet will not use the order of your own lambda vector to sort the vector of results.
Example with the data you used:
res <- glmnet(x, y, lambda=lambda)
res$lambda
So when you call the command lambda[which.min(result$cv.error)] at the end of your procedure, you will not get the value corresponding to the minimum of the cross-validated error. Also, it explains why your graph looks strange.
An easy fix would be to declare lambda at the beginning of the script as a decreasing vector:
lambda = seq(1, 0, 0.02)
Final remark: be careful when using a single lambda.
Some background: the nlm function in R is a general purpose optimization routine that uses Newton's method. To optimize a function, Newton's method requires the function, as well as the first and second derivatives of the function (the gradient vector and the Hessian matrix, respectively). In R the nlm function allows you to specify R functions that correspond to calculations of the gradient and Hessian, or one can leave these unspecified and numerical solutions are provided based on numerical derivatives (via the deriv function). More accurate solutions can be found by supplying functions to calculate the gradient and Hessian, so it's a useful feature.
My problem: the nlm function is slower and often fails to converge in a reasonable amount of time when the analytic Hessian is supplied. I'm guessing this is some sort of bug in the underlying code, but I'd be happy to be wrong. Is there a way to make nlm work better with an analytic Hessian matrix?
Example: my R code below demonstrates this problem using a logistic regression example, where
log(Pr(Y=1)/Pr(Y=0)) = b0 + Xb
where X is a multivariate normal of dimension N by p and b is a vector of coefficients of length p.
library(mvtnorm)
# example demonstrating a problem with NLM
expit <- function(mu) {1/(1+exp(-mu))}
mk.logit.data <- function(N,p){
set.seed(1232)
U = matrix(runif(p*p), nrow=p, ncol=p)
S = 0.5*(U+t(U)) + p*diag(rep(1,p))
X = rmvnorm(N, mean = runif(p, -1, 1), sigma = S)
Design = cbind(rep(1, N), X)
beta = sort(sample(c(rep(0,p), runif(1))))
y = rbinom(N, 1, expit(Design%*%beta))
list(X=X,y=as.numeric(y),N=N,p=p)
}
# function to calculate gradient vector at given coefficient values
logistic_gr <- function(beta, y, x, min=TRUE){
mu = beta[1] + x %*% beta[-1]
p = length(beta)
n = length(y)
D = cbind(rep(1,n), x)
gri = matrix(nrow=n, ncol=p)
for(j in 1:p){
gri[,j] = D[,j]*(exp(-mu)*y-1+y)/(1+exp(-mu))
}
gr = apply(gri, 2, sum)
if(min) gr = -gr
gr
}
# function to calculate Hessian matrix at given coefficient values
logistic_hess <- function(beta, y, x, min=TRUE){
# allow to fail with NA, NaN, Inf values
mu = beta[1] + x %*% beta[-1]
p = length(beta)
n = length(y)
D = cbind(rep(1,n), x)
h = matrix(nrow=p, ncol=p)
for(j in 1:p){
for(k in 1:p){
h[j,k] = -sum(D[,j]*D[,k]*(exp(-mu))/(1+exp(-mu))^2)
}
}
if(min) h = -h
h
}
# function to calculate likelihood (up to a constant) at given coefficient values
logistic_ll <- function(beta, y,x, gr=FALSE, he=FALSE, min=TRUE){
mu = beta[1] + x %*% beta[-1]
lli = log(expit(mu))*y + log(1-expit(mu))*(1-y)
ll = sum(lli)
if(is.na(ll) | is.infinite(ll)) ll = -1e16
if(min) ll=-ll
# the below specification is required for using analytic gradient/Hessian in nlm function
if(gr) attr(ll, "gradient") <- logistic_gr(beta, y=y, x=x, min=min)
if(he) attr(ll, "hessian") <- logistic_hess(beta, y=y, x=x, min=min)
ll
}
First example, with p=3:
dat = mk.logit.data(N=100, p=3)
The glm function estimates are for reference. nlm should give the same answer, allowing for small errors due to approximation.
(glm.sol <- glm(dat$y~dat$X, family=binomial()))$coefficients
> (Intercept) dat$X1 dat$X2 dat$X3
> 0.00981465 0.01068939 0.04417671 0.01625381
# works when correct analytic gradient is specified
(nlm.sol1 <- nlm(p=runif(dat$p+1), f=logistic_ll, gr=TRUE, y=dat$y, x=dat$X))$estimate
> [1] 0.009814547 0.010689396 0.044176627 0.016253966
# works, but less accurate when correct analytic hessian is specified (even though the routine notes convergence is probable)
(nlm.sol2 <- nlm(p=runif(dat$p+1), f=logistic_ll, gr=TRUE, he=TRUE, y=dat$y, x=dat$X, hessian = TRUE, check.analyticals=TRUE))$estimate
> [1] 0.009827701 0.010687278 0.044178416 0.016255630
But the problem becomes apparent when p is larger, here it is 10
dat = mk.logit.data(N=100, p=10)
Again, glm solution for reference. nlm should give the same answer, allowing for small errors due to approximation.
(glm.sol <- glm(dat$y~dat$X, family=binomial()))$coefficients
> (Intercept) dat$X1 dat$X2 dat$X3 dat$X4 dat$X5 dat$X6 dat$X7
> -0.07071882 -0.08670003 0.16436630 0.01130549 0.17302058 0.03821008 0.08836471 -0.16578959
> dat$X8 dat$X9 dat$X10
> -0.07515477 -0.08555075 0.29119963
# works when correct analytic gradient is specified
(nlm.sol1 <- nlm(p=runif(dat$p+1), f=logistic_ll, gr=TRUE, y=dat$y, x=dat$X))$estimate
> [1] -0.07071879 -0.08670005 0.16436632 0.01130550 0.17302057 0.03821009 0.08836472
> [8] -0.16578958 -0.07515478 -0.08555076 0.29119967
# fails to converge in 5000 iterations when correct analytic hessian is specified
(nlm.sol2 <- nlm(p=runif(dat$p+1), f=logistic_ll, gr=TRUE, he=TRUE, y=dat$y, x=dat$X, hessian = TRUE, iterlim=5000, check.analyticals=TRUE))$estimate
> [1] 0.31602065 -0.06185190 0.10775381 -0.16748897 0.05032156 0.34176104 0.02118631
> [8] -0.01833671 -0.20364929 0.63713991 0.18390489
Edit: I should also add that I have confirmed I have the correct Hessian matrix through multiple different approaches
I tried the code, but at first it seemed to be using a different rmvnorm than I can find on CRAN. I found one rmvnorm in dae package, then one in the mvtnorm package. The latter is the one to use.
nlm() was patched about the time of the above posting. I'm currently trying to verify the patches and it now seems to work OK. Note that I am author of a number of R's optimization codes, including 3/5 in optim().
nashjc at uottawa.ca
Code is below.
Revised code:
# example demonstrating a problem with NLM
expit <- function(mu) {1/(1+exp(-mu))}
mk.logit.data <- function(N,p){
set.seed(1232)
U = matrix(runif(p*p), nrow=p, ncol=p)
S = 0.5*(U+t(U)) + p*diag(rep(1,p))
X = rmvnorm(N, mean = runif(p, -1, 1), sigma = S)
Design = cbind(rep(1, N), X)
beta = sort(sample(c(rep(0,p), runif(1))))
y = rbinom(N, 1, expit(Design%*%beta))
list(X=X,y=as.numeric(y),N=N,p=p)
}
# function to calculate gradient vector at given coefficient values
logistic_gr <- function(beta, y, x, min=TRUE){
mu = beta[1] + x %*% beta[-1]
p = length(beta)
n = length(y)
D = cbind(rep(1,n), x)
gri = matrix(nrow=n, ncol=p)
for(j in 1:p){
gri[,j] = D[,j]*(exp(-mu)*y-1+y)/(1+exp(-mu))
}
gr = apply(gri, 2, sum)
if(min) gr = -gr
gr
}
# function to calculate Hessian matrix at given coefficient values
logistic_hess <- function(beta, y, x, min=TRUE){
# allow to fail with NA, NaN, Inf values
mu = beta[1] + x %*% beta[-1]
p = length(beta)
n = length(y)
D = cbind(rep(1,n), x)
h = matrix(nrow=p, ncol=p)
for(j in 1:p){
for(k in 1:p){
h[j,k] = -sum(D[,j]*D[,k]*(exp(-mu))/(1+exp(-mu))^2)
}
}
if(min) h = -h
h
}
# function to calculate likelihood (up to a constant) at given coefficient values
logistic_ll <- function(beta, y,x, gr=FALSE, he=FALSE, min=TRUE){
mu = beta[1] + x %*% beta[-1]
lli = log(expit(mu))*y + log(1-expit(mu))*(1-y)
ll = sum(lli)
if(is.na(ll) | is.infinite(ll)) ll = -1e16
if(min) ll=-ll
# the below specification is required for using analytic gradient/Hessian in nlm function
if(gr) attr(ll, "gradient") <- logistic_gr(beta, y=y, x=x, min=min)
if(he) attr(ll, "hessian") <- logistic_hess(beta, y=y, x=x, min=min)
ll
}
##!!!! NOTE: Must have this library loaded
library(mvtnorm)
dat = mk.logit.data(N=100, p=3)
(glm.sol <- glm(dat$y~dat$X, family=binomial()))$coefficients
# works when correct analytic gradient is specified
(nlm.sol1 <- nlm(p=runif(dat$p+1), f=logistic_ll, gr=TRUE, y=dat$y, x=dat$X))$estimate
# works, but less accurate when correct analytic hessian is specified (even though the routine notes convergence is probable)
(nlm.sol2 <- nlm(p=runif(dat$p+1), f=logistic_ll, gr=TRUE, he=TRUE, y=dat$y, x=dat$X, hessian = TRUE, check.analyticals=TRUE))$estimate
dat = mk.logit.data(N=100, p=10)
# Again, glm solution for reference. nlm should give the same answer, allowing for small errors due to approximation.
(glm.sol <- glm(dat$y~dat$X, family=binomial()))$coefficients
# works when correct analytic gradient is specified
(nlm.sol1 <- nlm(p=runif(dat$p+1), f=logistic_ll, gr=TRUE, y=dat$y, x=dat$X))$estimate
# fails to converge in 5000 iterations when correct analytic hessian is specified
(nlm.sol2 <- nlm(p=runif(dat$p+1), f=logistic_ll, gr=TRUE, he=TRUE, y=dat$y, x=dat$X, hessian = TRUE, iterlim=5000, check.analyticals=TRUE))$estimate
I've written a function to calculate the MLE estimates of a Generalised Pareto Distribution. When I use it with any data though I'm getting errors like this
1: In log(beta * ksi) : NaNs produced
2: In nlm(loglik, theta, stepmax = 5000, iterlim = 1000) :
NA/Inf replaced by maximum positive value
I was wondering if anyone could spot any mistakes with my code?
MLGPD<-function(data){
xi0 <- 1
beta0 <- 360
theta <- c(xi0, beta0)
excess <- data
assign("tmp", excess)
loglik <- function(theta){
ksi <- theta[1]
beta <- theta[2]
y <- ((tmp - 0.1)/beta)
f <- ((1/ksi)+1)*sum(log(1+y)) + length(tmp) * log(beta*ksi)
f
}
fit <- nlm(loglik, theta, stepmax = 5000, iterlim= 1000)
return(fit)
par.ests <- fit$x
return(par.ests)
}
#Checking our MLE algorithm works:
rgpd<-function(n,ksi, beta){
10000+beta*(((1-runif(n, min=0, max=1))^-ksi)-1)
}
rgpd1 <- rgpd(100, 1, 2.5)
MLGPD(rgpd1)
Thanks!