Related
I have run a multiple imputation (m=45, 10 iterations) using the MICE package, and want to calculate the cronbach's alpha for a number of ordinal scales in the data. Is there a function in r that could assist me in calculating the alpha coefficient across the imputed datasets in a manner that would satisfy Rubin's rules for pooling estimates?
We may exploit pool.scalar from the mice package, which performs pooling of univariate estimates according to Rubin's rules.
Since you have not provided a reproducible example yourself, I will provide one.
set.seed(123)
# sample survey responses
df <- data.frame(
x1 = c(1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3),
x2 = c(1,1,1,2,3,3,2,3,3,3,
1,1,1,2,3,3,2,3,3,3,
1,2,2,3,2,2,3,3,2,3),
x3 = c(1,1,2,1,2,3,3,3,2,3,
1,1,2,1,2,3,3,3,2,3,
1,2,2,3,2,2,3,3,2,3)
)
# function to column-wise generate missing values (MCAR)
create_missings <- function(data, prob) {
x <- replicate(ncol(data),rbinom(nrow(data), 1, prob))
for(k in 1:ncol(data)) {
data[, k] <- ifelse(x[, k] == 1, NA, data[,k])
}
data
}
df <- create_missings(df, prob = 0.2)
# multiple imputation ----------------------------------
library(mice)
imp <- mice(df, m = 10, maxit = 20)
# extract the completed data in long format
implong <- complete(imp, 'long')
We need a function to compute cronbach's alpha and obtain an estimate of the standard error of alpha, which can be used in a call to pool.scalar() later on. Since there is no available formula with which we can analytically estimate the standard error of alpha, we also need to deploy a bootstrapping procedure to estimate this standard error.
The function cronbach_fun() takes the following arguments:
list_compl_data: a character string specifying the list of completed data from a mids object.
boot: a logical indicating whether a non-parametrical bootstrap should be conducted.
B: an integer specifying the number of bootstrap samples to be taken.
ci: a logical indicating whether a confidence interval around alpha should be estimated.
cronbach_fun <- function(list_compl_data, boot = TRUE, B = 1e4, ci = FALSE) {
n <- nrow(list_compl_data); p <- ncol(list_compl_data)
total_variance <- var(rowSums(list_compl_data))
item_variance <- sum(apply(list_compl_data, 2, sd)^2)
alpha <- (p/(p - 1)) * (1 - (item_variance/total_variance))
out <- list(alpha = alpha)
boot_alpha <- numeric(B)
if (boot) {
for (i in seq_len(B)) {
boot_dat <- list_compl_data[sample(seq_len(n), replace = TRUE), ]
total_variance <- var(rowSums(boot_dat))
item_variance <- sum(apply(boot_dat, 2, sd)^2)
boot_alpha[i] <- (p/(p - 1)) * (1 - (item_variance/total_variance))
}
out$var <- var(boot_alpha)
}
if (ci){
out$ci <- quantile(boot_alpha, c(.025,.975))
}
return(out)
}
Now that we have our function to do the 'heavy lifting', we can run it on all m completed data sets, after which we can obtain Q and U (which are required for the pooling of the estimates). Consult ?pool.scalar for more information.
m <- length(unique(implong$.imp))
boot_alpha <- rep(list(NA), m)
for (i in seq_len(m)) {
set.seed(i) # fix random number generator
sub <- implong[implong$.imp == i, -c(1,2)]
boot_alpha[[i]] <- cronbach_fun(sub)
}
# obtain Q and U (see ?pool.scalar)
Q <- sapply(boot_alpha, function(x) x$alpha)
U <- sapply(boot_alpha, function(x) x$var)
# pooled estimates
pool_estimates <- function(x) {
out <- c(
alpha = x$qbar,
lwr = x$qbar - qt(0.975, x$df) * sqrt(x$t),
upr = x$qbar + qt(0.975, x$df) * sqrt(x$t)
)
return(out)
}
Output
# Pooled estimate of alpha (95% CI)
> pool_estimates(pool.scalar(Q, U))
alpha lwr upr
0.7809977 0.5776041 0.9843913
I'm working with the train() function from the caret package to fit multiple regression and ML models to test their fit. I'd like to write a function that iterates through all model types and enters the best fit into a dataframe. Biggest issue is that caret doesn't provide all the model fit statistics that I'd like so they need to be derived from the raw output. Based on my exploration there doesn't seem to be a standardized way caret outputs each models fit.
Another post (sorry don't have a link) created this function which pulls from fit$results and fit$bestTune to get pre calculated RMSE, R^2, etc.
get_best_result <- function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
One example of another fit statistic I need to calculate using raw output is BIC. The two functions below do that. The residuals (y_actual - y_predicted) are needed along with the number of x variables (k) and the number of rows used in the prediction (n). k and n must be derived from the output not the original dataset due to the models dropping x variables (feature selection) or rows (omitting NAs) based on its algorithm.
calculate_MSE <- function(residuals){
# residuals can be replaced with y_actual-y_predicted
mse <- mean(residuals^2)
return(mse)
}
calculate_BIC <- function(n, mse, k){
BIC <- n*log(mse)+k*log(n)
return(BIC)
}
The real question is is there a standardized output of caret::train() for x variables or either y_actual, y_predicted, or residuals?
I tried fit$finalModel$model and other methods but to no avail.
Here is a reproducible example along with the function I'm using. Please consider the functions above a part of this reproducible example.
library(rlist)
library(data.table)
# data
df <- data.frame(y1 = rnorm(50, 0, 1),
y2 = rnorm(50, .25, 1.5),
x1 = rnorm(50, .4, .9),
x2 = rnorm(50, 0, 1.1),
x3 = rnorm(50, 1, .75))
missing_index <- sample(1:50, 7, replace = F)
df[missing_index,] <- NA
# function to fit models and pull results
fitModels <- function(df, Ys, Xs, models){
# empty list
results <- list()
# number of for loops
loops_counter <- 0
# for every y
for(y in 1:length(Ys)){
# for every model
for(m in 1:length(models)){
# track loops
loops_counter <- loops_counter + 1
# fit the model
set.seed(1) # seed for reproducability
fit <- tryCatch(train(as.formula(paste(Ys[y], paste(Xs, collapse = ' + '),
sep = ' ~ ')),
data = df,
method = models[m],
na.action = na.omit,
tuneLength = 10),
error = function(e) {return(NA)})
# pull results
results[[loops_counter]] <- c(Y = Ys[y],
model = models[m],
sample_size = nrow(fit$finalModel$model),
RMSE = get_best_result(fit)[[2]],
R2 = get_best_result(fit)[[3]],
MAE = get_best_result(fit)[[4]],
BIC = calculate_BIC(n = length(fit$finalModel),
mse = calculate_MSE(fit$finalModel$residuals),
k = length(fit$finalModel$xNames)))
}
}
# list bind
results_df <- list.rbind(results)
return(results_df)
}
linear_models <- c('lm', 'glmnet', 'ridge', 'lars', 'enet')
fits <- fitModels(df, c(y1, y2), c(x1,x2,x3), linear_models)
I am using quantreg package to predict new data based on training set. However, I noticed a discrepancy between predict.rq or predict and doing it manually. Here is an example:
The quantile regression setting is
N = 10000
tauList = seq(1:11/12)/12
y = rchisq(N,2)
X = matrix( rnorm(3*N) ,nrow = N, ncol = 3 )
fit <- rq( y ~ X-1, tau = tauList, method = "fn")
The new data set I want to predict is
newdata <- matrix( rbeta((3*N),2,2) ,nrow = N,ncol=3 )
I use predict.rq or predict to predict newdata. Both return the same result:
fit_use_predict <- predict.rq( fit, newdata = as.data.frame(newdata) )
Also I manually do the prediction based on the coefficients matrix:
coef_mat <- coef(fit)
fit_use_multiplication <- newdata %*% coef_mat
I expect both are numerically identical, but they are not:
diff <- fit_use_predict - fit_use_multiplication
print(diff)
Their difference cannot be negligible.
However, predicting the original data set X, both return the same result, i.e.,
predict(fit, newdata = data.frame(X)) = X %*% coef_mat ## True
Do I miss something when using the function? Thanks!
A more serious problem here, before we get to prediction is that the model is forcing all of the fitted quantile functions through the origin of design space and since the covariates are centered at the origin all of the quantile functions are forced to cross there. Even if the X's all lie in the positive orthant it is quite a strong assumption to say that the distribution of the response is degenerate at the origin.
I think you just have to retain the 'X' name in your data as it was in the training data.
library(quantreg)
N = 10000
tauList = seq(1:11/12)/12
y = rchisq(N,2)
X = matrix( rnorm(3*N) ,nrow = N, ncol = 3 )
fit <- rq( y ~ X-1, tau = tauList, method = "fn")
newdata <- matrix( rbeta((3*N),2,2) ,nrow = N,ncol=3 )
fit_use_predict <- predict.rq( fit, newdata = data.frame(X=I(newdata)) )
coef_mat <- coef(fit)
fit_use_multiplication <- newdata %*% coef_mat
diff <- fit_use_predict - fit_use_multiplication
max( abs(diff) )
Output is 0
I need to implement a logistic regression manually, using the Score/GMM approach, without the use of GLM. This is because at later stages the model will be much more complicated. Currently I am running into a problem where for the logistic regression, the optimization procedures are very initial point dependent.To illustrate, here is my code using an online dataset. More details about the procedure are in the comments:
library(data,table)
library(nleqslv)
library(Matrix)
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
data_analysis<-data.table(mydata)
data_analysis[,constant:=1]
#Likelihood function for logit
#The logistic regression will regress the binary variable
#admit on a constant and the variable gpa
LL <- function(beta){
beta=as.numeric(beta)
data_temp=data_analysis
mat_temp2 = cbind(data_temp$constant,
data_temp$gpa)
one = rep(1,dim(mat_temp2)[1])
h = exp(beta %*% t(mat_temp2))
choice_prob = h/(1+h)
llf <- sum(data_temp$admit * log(choice_prob)) + (sum((one-data_temp$admit) * log(one-choice_prob)))
return(-1*llf)
}
#Score to be used when optimizing using LL
#Identical to the Score function below but returns negative output
Score_LL <- function(beta){
data_temp=data_analysis
mat_temp2 = cbind(data_temp$constant,
data_temp$gpa)
one = rep(1,dim(mat_temp2)[1])
h = exp(beta %*% t(mat_temp2))
choice_prob = h/(1+h)
resid = as.numeric(data_temp$admit - choice_prob)
score_final2 = t(mat_temp2) %*% Diagonal(length(resid), x=resid) %*% one
return(-1*as.numeric(score_final2))
}
#The Score/Deriv/Jacobian of the Likelihood function
Score <- function(beta){
data_temp=data_analysis
mat_temp2 = cbind(data_temp$constant,
data_temp$gpa)
one = rep(1,dim(mat_temp2)[1])
h = exp(beta %*% t(mat_temp2))
choice_prob = as.numeric(h/(1+h))
resid = as.numeric(data_temp$admit - choice_prob)
score_final2 = t(mat_temp2) %*% Diagonal(length(resid), x=resid) %*% one
return(as.numeric(score_final2))
}
#Derivative of the Score function
Score_Deriv <- function(beta){
data_temp=data_analysis
mat_temp2 = cbind(data_temp$constant,
data_temp$gpa)
one = rep(1,dim(mat_temp2)[1])
h = exp(beta %*% t(mat_temp2))
weight = (h/(1+h)) * (1- (h/(1+h)))
weight_mat = Diagonal(length(weight), x=weight)
deriv = t(mat_temp2)%*%weight_mat%*%mat_temp2
return(-1*as.array(deriv))
}
#Quadratic Gain function
#Minimized at Score=0 and so minimizing is equivalent to solving the
#FOC of the Likelihood. This is the GMM approach.
Quad_Gain<- function(beta){
h=Score(as.numeric(beta))
return(sum(h*h))
}
#Derivative of the Quadratic Gain function
Quad_Gain_deriv <- function(beta){
return(2*t(Score_Deriv(beta))%*%Score(beta))
}
sol1=glm(admit ~ gpa, data = data_analysis, family = "binomial")
sol2=optim(c(2,2),Quad_Gain,gr=Quad_Gain_deriv,method="BFGS")
sol3=optim(c(0,0),Quad_Gain,gr=Quad_Gain_deriv,method="BFGS")
When I run this code, I get that sol3 matches what glm produces (sol1) but sol2, with a different initial point, differs from the glm solution by a lot. This is something happening in my main code with the actual data as well. One solution is to create a grid and test multiple starting points. However, my main data set has 10 parameters and this would make the grid very large and the program computationally infeasible. Is there a way around this problem?
Your code seems overly complicated. The following two functions define the negative log-likelihood and negative score vector for a logistic regression with the logit link:
logLik_Bin <- function (betas, y, X) {
eta <- c(X %*% betas)
- sum(dbinom(y, size = 1, prob = plogis(eta), log = TRUE))
}
score_Bin <- function (betas, y, X) {
eta <- c(X %*% betas)
- crossprod(X, y - plogis(eta))
}
Then you can use it as follows:
# load the data
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# fit with optim()
opt1 <- optim(c(-1, 1, -1), logLik_Bin, score_Bin, method = "BFGS",
y = mydata$admit, X = cbind(1, mydata$gre, mydata$gpa))
opt1$par
# compare with glm()
glm(admit ~ gre + gpa, data = mydata, family = binomial())
Typically, for well-behaved covariates (i.e., expecting to have a coefficients in the interval [-4 to 4]), starting at 0 is a good idea.
I am using 'KFAS' package from R to estimate a state-space model with the Kalman filter. My measurement and transition equations are:
y_t = Z_t * x_t + \eps_t (measurement)
x_t = T_t * x_{t-1} + R_t * \eta_t (transition),
with \eps_t ~ N(0,H_t) and \eta_t ~ N(0,Q_t).
So, I want to estimate the variances H_t and Q_t, but also T_t, the AR(1) coefficient. My code is as follows:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
fit <- fitSSM(ss_model, inits = c(0,0.6,0), method = 'L-BFGS-B')
But it returns: "Error in is.SSModel(do.call(updatefn, args = c(list(inits, model), update_args)),: System matrices (excluding Z) contain NA or infinite values, covariance matrices contain values larger than 1e+07"
The NA definitions for the variances works well, as documented in the package's paper. However, it seems this cannot be done for the AR coefficients. Does anyone know how can I do this?
Note that I am aware of the SSMarima function, which eases the definition of the transition equation as ARIMA models. Although I am able to estimate the AR(1) coef. and Q_t this way, I still cannot estimate the \eps_t variance (H_t). Moreover, I am migrating my Kalman filter codes from EViews to R, so I need to learn SSMcustom for other models that are more complicated.
Thanks!
It seems that you are missing something in your example, as your error message comes from the function fitSSM. If you want to use fitSSM for estimating general state space models, you need to provide your own model updating function. The default behaviour can only handle NA's in covariance matrices H and Q. The main goal of fitSSM is just to get started with simple stuff. For complex models and/or large data, I would recommend using your self-written objective function (with help of logLik method) and your favourite numerical optimization routines manually for maximum performance. Something like this:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
objf <- function(pars, model, estimate = TRUE) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
if (estimate) {
-logLik(model)
} else {
model
}
}
opt <- optim(c(1, 0.5, 1), objf, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100), model = ss_model)
ss_model_opt <- objf(opt$par, ss_model, estimate = FALSE)
Same with fitSSM:
updatefn <- function(pars, model) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
model
}
fit <- fitSSM(ss_model, c(1, 0.5, 1), updatefn, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100))
identical(ss_model_opt, fit$model)