Related
For some reason, I can get my function that gives me BIC data to find the best model in a multivariable regression to work with one dataset but not another even though I can't see why it would break. When it doesn't work(for Airline data) it gives the following error:
Error in leaps(X, y, method = "r2", nbest = 1, names = colnames(X)) :
leaps requires full-rank design matrix; use regsubsets()
3. stop("leaps requires full-rank design matrix; use regsubsets()")
2. leaps(X, y, method = "r2", nbest = 1, names = colnames(X)) at leaps_AIC.R#22
leaps_AIC(X = baseball[, 2:14], y = baseball[, 1])
Function:
leaps_AIC = function(X, y) {
#
# X is an n x k matrix and y a vector of n values.
#
# This function fits linear models of the sort lm(y~X[,cols]), where
# cols is a subset of {1,...,k}. It uses the package leaps to find
# the best model (according to R^2) among all those for which the
# number of elements of cols is j, and it does so for j=1,...,k. It
# then determines AIC and BIC for each of the k models so determined.
#
# In case X and y aren't of the right mode, make them a matrix and
# a vector, respectively.
#
require(leaps)
y = as.vector(y)
X = as.matrix(X)
num = ncol(X)
#
# Create the leaps output, returning only the best model for each
# number of variables (i.e., nbest=1).
#
out = leaps(X, y, method = 'r2', nbest = 1, names = colnames(X))
#
# Initialize the aic and bic vectors.
#
aic = 1:num
bic = 1:num
#
# Compute all the AIC and BIC values.
#
for (j in 1:num) {
#
# Determine the variables in the best model containing j independent
# variables.
#
cols = (1:num)[out$which[j, ] == TRUE]
fit = lm(y ~ X[, cols])
#
# Determine AIC and BIC for the best model with j variables.
#
aic[j] = AIC(fit)
bic[j] = AIC(fit, k = log(length(y)))
}
#
# Print out the output.
#
print(cbind(out$which, 'R2' = out$r2, 'AIC' = aic, 'BIC' = bic))
#
# Return the output.
#
return(list('leaps.out' = out, 'AIC' = aic, 'BIC' = bic))
}
The function works just fine doing the below with a Baseball-Salary-Data dataset
Data = read.csv("Baseball-Salary-Data.csv", header = T)
baseball = Data[, 1:17]
source('leaps_AIC.R')
aic.bic.results = leaps_AIC(X = baseball[, 2:17], y = baseball[, 1])
n.x.best = which.min(aic.bic.results$BIC) # number of independent variables in the best model chosen by BIC
x.names = colnames(baseball)[2:17]
x.masks = aic.bic.results$leaps.out$which[n.x.best, ]
x.best = x.names[x.masks] # independent variables in the best model
x.best
Here's baseball-salary-data, or a few rows of it anyway:
salary,batting average,on base percent,runs,hits,doubles,triples,home runs,rbi,walks,strike outs,stolen bases,errors,free agent eligible,free agent,arbitration eligible,arbitration,player
3300,0.272,0.302,69,153,21,4,31,104,22,80,4,3,1,0,0,0,Andre Dawson
2600,0.269,0.335,58,111,17,2,18,66,39,69,0,3,1,1,0,0,Steve Buchele
2500,0.249,0.337,54,115,15,1,17,73,63,116,6,5,1,0,0,0,Kal Daniels
2475,0.26,0.292,59,128,22,7,12,50,23,64,21,21,0,0,1,0,Shawon Dunston
2313,0.273,0.346,87,169,28,5,8,58,70,53,3,8,0,0,1,0,Mark Grace
2175,0.291,0.379,104,170,32,2,26,100,87,89,22,4,1,0,0,0,Ryne Sandberg
Now, with Airline Data, it just doesn't work
Airline Data(from a Wendover productions video but I deleted wow air because some of their data points weren't available, idk, I was really curious to see what the outcome was if i did MVR on it)
EBIT,Average LH Stage Length (mi),Cabin (ft^2/st),PrEcon,LhSh,LH Routes,Hubs,Origin Traffic Coeff Var,Origin GDP/C,Avg Efficiency,Utilization,Total Weekly Route Flight Freq,Competitors/Route,Avg LH Aircraft Age
-1077500000.0,4927,9.145,0.1652,0.0846150,16.0,3.00,0.176930,53074.5000,77.090909,0.599597,6.0,0.50,15.083333
47000000.0000,4802,8.047,0.0759,0.4210530,50.0,2.00,0.155502,53074.5000,76.000000,0.625681,4.0,0.340,24.000000
-108000000.00,4712,8.056,0.1392,0.2890630,51.0,9.00,0.158911,48612.5745,98.351351,0.674158,21.0,1.630,2.3250000
108000000.000,4028,6.539,0.0000,0.1176470,3.00,1.00,0.066578,8951.10000,79.000000,0.502052,21.0,1.660,4.7500000
111000000.000,4169,9.482,0.1042,0.0857140,10.0,4.00,0.091657,16096.4000,83.000000,0.648486,7.00,0.450,12.700000
187000000.000,3976,7.770,0.0669,0.0932220,15.0,6.00,0.069247,51663.4000,96.000000,0.675699,13.0,1.150,5.0000000
406000000.000,3865,7.182,0.1800,0.2066120,66.0,18.0,0.187288,52514.4439,92.720000,0.622061,3.00,0.230,7.9600000
43000000.0000,3614,7.562,0.0850,0.6896550,15.0,2.00,0.054917,96032.9133,97.500000,0.614427,24.0,1.250,2.9000000
-21000000.000,3201,8.135,0.0329,0.0975610,6.00,1.00,0.063495,31782.2000,79.000000,0.692841,19.0,1.400,8.4750000
271000000.000,3114,3.114,0.1301,0.0484850,34.0,5.00,0.118285,48130.3000,87.500000,0.403065,10.0,0.860,14.000000
and here's the code for Airline
Data = read.csv("Airline-Data.csv", header = T)
airline = Data[, 1:14]
source('leaps_AIC.R')
aic.bic.results = leaps_AIC(X = airline[, 2:14], y = airline[, 1])
n.x.best = which.min(aic.bic.results$BIC) # number of independent variables in the best model chosen by BIC
x.names = colnames(baseball)[2:14]
x.masks = aic.bic.results$leaps.out$which[n.x.best, ]
x.best = x.names[x.masks] # independent variables in the best model
x.best
I have run a multiple imputation (m=45, 10 iterations) using the MICE package, and want to calculate the cronbach's alpha for a number of ordinal scales in the data. Is there a function in r that could assist me in calculating the alpha coefficient across the imputed datasets in a manner that would satisfy Rubin's rules for pooling estimates?
We may exploit pool.scalar from the mice package, which performs pooling of univariate estimates according to Rubin's rules.
Since you have not provided a reproducible example yourself, I will provide one.
set.seed(123)
# sample survey responses
df <- data.frame(
x1 = c(1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3),
x2 = c(1,1,1,2,3,3,2,3,3,3,
1,1,1,2,3,3,2,3,3,3,
1,2,2,3,2,2,3,3,2,3),
x3 = c(1,1,2,1,2,3,3,3,2,3,
1,1,2,1,2,3,3,3,2,3,
1,2,2,3,2,2,3,3,2,3)
)
# function to column-wise generate missing values (MCAR)
create_missings <- function(data, prob) {
x <- replicate(ncol(data),rbinom(nrow(data), 1, prob))
for(k in 1:ncol(data)) {
data[, k] <- ifelse(x[, k] == 1, NA, data[,k])
}
data
}
df <- create_missings(df, prob = 0.2)
# multiple imputation ----------------------------------
library(mice)
imp <- mice(df, m = 10, maxit = 20)
# extract the completed data in long format
implong <- complete(imp, 'long')
We need a function to compute cronbach's alpha and obtain an estimate of the standard error of alpha, which can be used in a call to pool.scalar() later on. Since there is no available formula with which we can analytically estimate the standard error of alpha, we also need to deploy a bootstrapping procedure to estimate this standard error.
The function cronbach_fun() takes the following arguments:
list_compl_data: a character string specifying the list of completed data from a mids object.
boot: a logical indicating whether a non-parametrical bootstrap should be conducted.
B: an integer specifying the number of bootstrap samples to be taken.
ci: a logical indicating whether a confidence interval around alpha should be estimated.
cronbach_fun <- function(list_compl_data, boot = TRUE, B = 1e4, ci = FALSE) {
n <- nrow(list_compl_data); p <- ncol(list_compl_data)
total_variance <- var(rowSums(list_compl_data))
item_variance <- sum(apply(list_compl_data, 2, sd)^2)
alpha <- (p/(p - 1)) * (1 - (item_variance/total_variance))
out <- list(alpha = alpha)
boot_alpha <- numeric(B)
if (boot) {
for (i in seq_len(B)) {
boot_dat <- list_compl_data[sample(seq_len(n), replace = TRUE), ]
total_variance <- var(rowSums(boot_dat))
item_variance <- sum(apply(boot_dat, 2, sd)^2)
boot_alpha[i] <- (p/(p - 1)) * (1 - (item_variance/total_variance))
}
out$var <- var(boot_alpha)
}
if (ci){
out$ci <- quantile(boot_alpha, c(.025,.975))
}
return(out)
}
Now that we have our function to do the 'heavy lifting', we can run it on all m completed data sets, after which we can obtain Q and U (which are required for the pooling of the estimates). Consult ?pool.scalar for more information.
m <- length(unique(implong$.imp))
boot_alpha <- rep(list(NA), m)
for (i in seq_len(m)) {
set.seed(i) # fix random number generator
sub <- implong[implong$.imp == i, -c(1,2)]
boot_alpha[[i]] <- cronbach_fun(sub)
}
# obtain Q and U (see ?pool.scalar)
Q <- sapply(boot_alpha, function(x) x$alpha)
U <- sapply(boot_alpha, function(x) x$var)
# pooled estimates
pool_estimates <- function(x) {
out <- c(
alpha = x$qbar,
lwr = x$qbar - qt(0.975, x$df) * sqrt(x$t),
upr = x$qbar + qt(0.975, x$df) * sqrt(x$t)
)
return(out)
}
Output
# Pooled estimate of alpha (95% CI)
> pool_estimates(pool.scalar(Q, U))
alpha lwr upr
0.7809977 0.5776041 0.9843913
I am trying to manually pool results from quantile regression models run on multiply imputed data in R using mice. I make use of a bootstrapping procedure to get 95% CIs and P values of the model terms, in which model parameters and their standard errors are obtained after sampling a certain number of rows that is equal to the unique number of participants in my data set. This procedure is repeated 500 times for each of the m imputed data sets. Then, as a last step, I pool the estimated coefficients and their standard errors of the resulting 500 * m regression models according to Rubin's rules (1987) (see e.g. https://bookdown.org/mwheymans/bookmi/rubins-rules.html). To speed things up, I use foreach to split up the analyses over multiple processor cores and for to loop over the m imputed data sets.
However, there seems to be a flaw in the part wherein the results are pooled. When I look at the pooled results, I observe that the P values are not in accordance with the 95% CIs (e.g. P < 0.05 when 0 is included in the 95% CI).
To illustrate this issue, I have made a reproducible example, using these publicly available data: https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv
Because there are no missing data in this data set, I introduce them myself and impute the data (m = 10 multiply imputed data sets with 20 iterations). I use set.seed for reproducibility.
Note that I use lm instead of quantreg::rq in this example.
# load data
projdir <- "my_directory"
d <- read.csv(file = file.path(projdir, 'heart_failure_clinical_records_dataset.csv'))
#### introduce missing values
set.seed(1)
# age
age_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$age[age_miss_tag == 1] <- NA # MCAR
# serum creatinine
creat_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$serum_creatinine[creat_miss_tag == 1 & d$anaemia == 0] <- NA # MAR
# CK
CK_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$creatinine_phosphokinase[CK_miss_tag & d$platelets > median(d$platelets)] <- NA # MAR
# platelets
platelets_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$platelets[platelets_miss_tag == 1] <- NA # MCAR
library(mice); library(mitml); library(miceadds); library(splines); library(foreach); library(doParallel)
# impute data
imp <- mice(d, maxit = 20, m = 10, seed = 2)
# log creatinine
implong <- complete(imp, 'long', include = FALSE)
implong$log_creat <- log(implong$serum_creatinine)
imp <- miceadds::datlist2mids(split(implong, implong$.imp))
# compute values for Boundary.knots
temp <- complete(imp, 'long', include = FALSE)
B_knots <- rowMeans(sapply(split(temp, temp$.imp), function(x) {
quantile(x$age, c(0.1, 0.9))
}))
# Convert mids object into a datlist
longlist <- miceadds::mids2datlist(imp)
# fit model based on origial data and use the terms in the below foreach loop
# in order to fix the position of the inner knots
fit_orig <- lm(log_creat ~
# Main effects
ns(age, df = 2, B = c(B_knots[1], B_knots[2])) * sex,
data = longlist[[1]])
To further speed things up, I use OLS instead of quantile regression here and parallelize the process.
# make cluster used in foreach
cores_2_use <- detectCores() - 1
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, iseed = 9956)
registerDoParallel(cl)
# No. of bootstrap samples to be taken
n_iter <- 500
boot.1 <- c()
for(k in seq_along(longlist)){
boot.1[[k]] <- foreach(i = seq_len(n_iter),
.combine = rbind,
.packages = c('mice', 'mitml', 'splines')) %dopar% {
# store data from which rows can be samples
longlist0 <- longlist[[k]]
# set seed for reproducibility
set.seed(i)
# sample rows
boot_dat <- longlist0[sample(1:nrow(longlist0), replace = TRUE), ]
# linear regression model based on sampled rows
fit1 <- lm(terms(fit_orig), data = boot_dat)
# save coefficients
fit1$coef
}
}
stopCluster(cl)
As a last step, I pool the results according to Rubin's rules.
n_cols <- dim(boot.1[[1]])[2]
list <- c()
for(i in seq_len(n_cols)) {
# extract coefficients
parameter <- lapply(boot.1, function(x){
x[,i]
})
m <- length(parameter)
for(k in seq_len(m)) {
names(parameter[[k]]) <- NULL
}
Q <- sapply(parameter, mean)
U <- sapply(parameter, var) # (standard error of estimate)^2
#### Pooling
# Pooled univariate estimate
qbar <- mean(Q)
# Mean of the variances (i.e. the pooled within-imputation variance)
ubar <- mean(U)
# Between-imputation variance
btw_var <- var(Q)
# Total variance of the pooled estimated
tot_var <- ubar + btw_var + (btw_var / m)
# Relative increase in variance due to non-response
r_var <- (btw_var + (btw_var / m)) / ubar
# Fraction of missing information
lambda <- (btw_var + (btw_var / m)) / tot_var
# degrees of freedom for the t-distribution according to Rubin (1987)
df_old <- (m - 1) / lambda^2
# sample size in the imputed data sets
n_sample <- nrow(longlist[[1]])
# observed degrees of freedom
df_observed <- (((n_sample - n_cols) + 1) / ((n_sample - n_cols) + 3)) *
(n_sample - n_cols) * (1 - lambda)
# adjusted degrees of freedom according to Barnard & Rubin (1999)
df_adjusted <- (df_old * df_observed) / (df_old + df_observed)
# 95% confidence interval of qbar
lwr <- qbar - qt(0.975, df_adjusted) * sqrt(tot_var)
upr <- qbar + qt(0.975, df_adjusted) * sqrt(tot_var)
# F statistic
q <- ((0 - qbar)^2 / tot_var)^2
# Significance level associated with the null value Q[0]
p_value <- pf(q, df1 = 1, df2 = df_adjusted, lower.tail = FALSE)
list[[i]] <- cbind(qbar, lwr, upr, p_value)
}
names(list) <- colnames(boot.1[[1]])
list
Obviously, the P value shown below is not in accordance with the 95% CI (as 0 is included in the CI, so the P value should be ≥0.05).
> list
$`(Intercept)`
qbar lwr upr p_value
[1,] 0.06984595 -0.02210231 0.1617942 0.008828337
EDIT (29 Dec 2021)
As #Gerko Vink notes in his answer, multiple imputation and bootstrapping both induce variance. The variance induced by imputation is taken care of by Rubin's rules, the bootstrap variance is not. Unfortunately, mice::pool will not work with the output returned by quantreg::rq.
I am aware of constructing bootstrap CIs based on a naive percentile-based approach as shown in this post, but I am inclined to think this is not the correct approach to proceed with.
Does anyone know how to appropriately take care of the extra variance induced by bootstrapping when using rq?
EDIT (30 Dec 2021)
Inspired by this recent post, I decided not to hit the road of bootstrapping anymore, but instead manually extract the point estimates and variances from each of the imputed data sets and pool them using Rubin's rules. I have posted this approach as answer below. Any input on how to appropriately take care of the extra variance induced by bootstrapping when using rq is still very welcome though!
Bootstrapping and multiple imputation both induce variance. The imputation variance is taken care of by Rubin's rules for parameters with normal sampling distributions. The bootstrap variance is not.
Two remarks:
First, there is a small error in your code. You're calculating the bootstrap variance about Q in U <- sapply(parameter, var). No need for U <- U/n_iter. U is already the variance and sapply(parameter, sd) would yield the bootstrapped standard error.
Second, you're using bootstrap parameters to calculate a parametric interval and p-value. That seems needlessly complicated and, as you can see, potentially problematic. Why not calculate the bootstrap CI?
See also this link for some inspiration with respect to different means of calculating the CIs and their respective validity.
A small sim that demonstrates that you cannot expect both to be identical for a finite set of bootstrap replications.
library(purrr)
library(magrittr)
#fix seed
set.seed(123)
#some data
n = 1000
d <- rnorm(n, 0, 1)
# ci function
fun <- function(x){
se <- var(x)/length(x)
lwr <- mean(x) - 1.96 * se
upr <- mean(x) + 1.96 * se
ci <- c(lwr, upr)
return(ci)
}
# bootstrap
boot <- replicate(500,
d[sample(1:1000, 1000, replace = TRUE)],
simplify = FALSE)
# bootstrapped ci's based on parameters
boot.param.ci <- boot %>%
map(~.x %>% fun) %>%
do.call("rbind", args = .)
# bootstrap CI
boot.ci <- boot %>%
map(~.x %>% mean) %>%
unlist %>%
quantile(c(.025, .975))
# Overview
data.frame(param = fun(d),
boot.param = boot.param.ci %>% colMeans,
boot.ci = boot.ci)
#> param boot.param boot.ci
#> 2.5% 0.01420029 0.01517527 -0.05035913
#> 97.5% 0.01805545 0.01904181 0.07245449
Created on 2021-12-22 by the reprex package (v2.0.1)
The following reprex also demonstrates that the bootstrap applied to the imputed data yields different variance estimates under the same pooling rules.
library(purrr)
library(magrittr)
library(mice)
#fix seed
set.seed(123)
imp <- mice(boys,
m = 10,
printFlag = FALSE)
imp %>%
complete("all") %>%
map(~.x %$%
lm(age ~ hgt + hc)) %>%
pool %>%
summary(conf.int = TRUE)
#> term estimate std.error statistic df p.value 2.5 %
#> 1 (Intercept) -1.9601179 0.809167659 -2.422388 682.5182 0.01567825 -3.5488747
#> 2 hgt 0.1690468 0.002784939 60.700342 572.1861 0.00000000 0.1635768
#> 3 hc -0.2138941 0.021843724 -9.792018 639.0432 0.00000000 -0.2567883
#> 97.5 %
#> 1 -0.3713610
#> 2 0.1745167
#> 3 -0.1710000
imp %>%
complete("all") %>%
map(~.x %>%
.[sample(1:748, 748, replace = TRUE), ] %$%
lm(age ~ hgt + hc)) %>%
pool %>%
summary(conf.int = TRUE)
#> term estimate std.error statistic df p.value 2.5 %
#> 1 (Intercept) -1.9810146 1.253312293 -1.580623 22.57546 1.278746e-01 -4.5763892
#> 2 hgt 0.1689181 0.004124538 40.954423 24.47123 0.000000e+00 0.1604141
#> 3 hc -0.2133606 0.033793045 -6.313743 22.29686 2.217445e-06 -0.2833890
#> 97.5 %
#> 1 0.6143599
#> 2 0.1774221
#> 3 -0.1433322
Created on 2021-12-22 by the reprex package (v2.0.1)
For quantile regression, mice::pool will not work with the output returned by quantreg::rq, because (according to this post) there is no agreed upon method to calculate standard errors, which are required to pool results under multiple imputation.
An ad hoc solution would be to manually extract the point estimates and variances from each of the imputed data sets and pool them using Rubin's rules.
First, a reprex using lm to verify whether results from the manual approach and mice::pool are identical.
library(mice)
imp <- mice(nhanes, print = FALSE, seed = 123)
# fit linear model
fit <- with(imp, lm(bmi ~ chl + hyp))
# manually pool univariate estimates using Rubin's rules
pool_manual <- \(model_object) {
m <- length(model_object$analyses)
Q <- sapply(model_object$analyses, \(x) summary(x)$coefficients[, 'Estimate'])
U <- sapply(model_object$analyses, \(x) (summary(x)$coefficients[, 'Std. Error'])^2)
qbar <- rowMeans(Q)
ubar <- rowMeans(U)
btw_var <- apply(Q, 1, var)
tot_var <- ubar + btw_var + (btw_var / m)
lambda <- (btw_var + (btw_var / m)) / tot_var
df_old <- (m - 1) / lambda^2
n_sample <- length(residuals(model_object$analyses[[1]]))
n_cols <- dim(Q)[1]
df_com <- n_sample - n_cols
df_observed <- ((df_com + 1) / (df_com + 3)) * df_com * (1 - lambda)
df_adjusted <- (df_old * df_observed) / (df_old + df_observed)
lwr <- qbar - qt(0.975, df_adjusted) * sqrt(tot_var)
upr <- qbar + qt(0.975, df_adjusted) * sqrt(tot_var)
q <- (0 - qbar)^2 / tot_var
p_value <- pf(q, df1 = 1, df2 = df_adjusted, lower.tail = FALSE)
df <- data.frame(noquote(rownames(Q)), qbar, lwr, upr, p_value)
rownames(df) <- NULL
names(df) <- c('term', 'estimate', '2.5 %', '97.5 %', 'p.value')
return(df)
}
Verify.
> pool_manual(fit)
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 21.78583831 8.99373786 34.57793875 0.004228746
2 chl 0.03303449 -0.02812005 0.09418903 0.254696358
3 hyp -1.07291395 -5.57406829 3.42824039 0.624035769
> extract <- c('term', 'estimate', '2.5 %', '97.5 %', 'p.value')
> summary(pool(fit), conf.int = TRUE)[, extract]
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 21.78583831 8.99373786 34.57793875 0.004228746
2 chl 0.03303449 -0.02812005 0.09418903 0.254696358
3 hyp -1.07291395 -5.57406829 3.42824039 0.624035769
Quantile regression
Now, let's pool results from rq for the expected median of the outcome.
library(quantreg)
# fit quantile regression model
fit <- with(imp, rq(bmi ~ chl + hyp, tau = 0.5))
To be able to pool results from rq, only the summary method used to extract point estimates and variances from each of the imputed data sets needs to be adjusted in pool_manual.
Q <- sapply(model_object$analyses, \(x) summary.rq(x, covariance = TRUE)$coefficients[, 'Value'])
U <- sapply(model_object$analyses, \(x) (summary.rq(x, covariance = TRUE)$coefficients[, 'Std. Error'])^2)
Result
> pool_manual(fit)
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 22.23452856 0.8551626 43.6138945 0.04461337
2 chl 0.03487894 -0.0857199 0.1554778 0.47022312
3 hyp -1.43636147 -6.0666990 3.1939761 0.52455041
> summary(pool(fit), conf.int = TRUE)[, extract]
Error in rq.fit.br(x, y, tau = tau, ci = TRUE, ...) :
unused arguments (effects = "fixed", parametric = TRUE, exponentiate = FALSE)
I am trying to figure out how to sample from a custom density in rJAGS but am running into issues. having searched the site, I saw that there is a zeroes (or ones) trick that can be employed based on BUGS code but am having a hard time with its implementation in rJAGS. I think I am doing it correctly but keep getting the following error:
Error in jags.model(model1.spec, data = list(x = x, N = N), n.chains = 4, :
Error in node dpois(lambda)
Length mismatch in Node::setValue
Here is my rJAGS code for reproducibility:
library(rjags)
set.seed(4)
N = 100
x = rexp(N, 3)
L = quantile(x, prob = 1) # Censoring point
censor = ifelse(x <= L, 1, 0) # Censoring indicator
x[censor == 1] <- L
model1.string <-"
model {
for (i in 1:N){
x[i] ~ dpois(lambda)
lambda <- -N*log(1-exp(-(1/mu)))
}
mu ~ dlnorm(mup, taup)
mup <- log(.0001)
taup <- 1/49
R <- 1 - exp(-(1/mu) * .0001)
}
"
model1.spec<-textConnection(model1.string)
jags <- jags.model(model1.spec,
data = list('x' = x,
'N' = N),
n.chains=4,
n.adapt=100)
Here, my negative log likelihood of the density I am interested in is -N*log(1-exp(-(1/mu))). Is there an obvious mistake in the code?
Using the zeros trick, the variable on the left-hand side of the dpois() relationship has to be an N-length vector of zeros. The variable x should show up in the likelihood somewhere. Here is an example using the normal distribution.
set.seed(519)
N <- 100
x <- rnorm(100, mean=3)
z <- rep(0, N)
C <- 10
pi <- pi
model1.string <-"
model {
for (i in 1:N){
lambda[i] <- pow(2*pi*sig2, -0.5) * exp(-.5*pow(x[i]-mu, 2)/sig2)
loglam[i] <- log(lambda[i]) + C
z[i] ~ dpois(loglam[i])
}
mu ~ dnorm(0,.1)
tau ~ dgamma(1,.1)
sig2 <- pow(tau, -1)
sumLL <- sum(log(lambda[]))
}
"
model1.spec<-textConnection(model1.string)
set.seed(519)
jags <- jags.model(model1.spec,
data = list('x' = x,
'z' = z,
'N' = N,
'C' = C,
'pi' = pi),
inits = function()list(tau = 1, mu = 3),
n.chains=4,
n.adapt=100)
samps1 <- coda.samples(jags, c("mu", "sig2"), n.iter=1000)
summary(samps1)
Iterations = 101:1100
Thinning interval = 1
Number of chains = 4
Sample size per chain = 1000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu 4.493 2.1566 0.034100 0.1821
sig2 1.490 0.5635 0.008909 0.1144
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu 0.6709 3.541 5.218 5.993 7.197
sig2 0.7909 0.999 1.357 1.850 2.779
I have the following code:
z7 <- function(data, k, e){
require(zoo)
df = data
r = df$ROA
t = df$t
EA = df$EA
k = k
e = e
#Estimate rolling linear models
models = rollapply(df, width = k, FUN = function(z)
coef(lm(r~t, data = as.data.frame(z))), by.column = FALSE, align ="right")
#Extract residuals from the models
res = rollapply(df, width= k, FUN = function(x)
residuals(lm(r~t, data = as.data.frame(x))), by.column = FALSE, align ="right")
#Standard deviation and Mean of residuals, on a row basis
s = as.data.frame(apply(res, 1, sd))
m = as.data.frame(apply(res, 1, mean)) #note that this is aproximately 0 due to detrending.
#Combine the data define n as number of rows in the dataset
dataset = cbind(models, res, m, s)
n = as.vector(nrow(dataset))
n
dataset
#Compute predictions at k+1
for(i in n){
x = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x)
x = x + 1
}
#Compute coefficient of variation
for(j in n){
n2 = k +1
tau = ((1 + 1 / (4*(n2))) * (dataset$apply.res..1..sd./dataset$apply.res..1..mean.))
}
dataset3 = cbind(dataset, tau)
dataset3
#Compute mean of chi distribution and the adjusted standard deviation
Mchi <- sqrt(2)*((gamma((k+1)/2))/gamma(k/2))
S = s*Mchi*(k+1)/sqrt(k)
#Compute z7, checking whether the adjusted sd or cv should be used
for(i in nrow(dataset3)){
if (abs(dataset3$tau*dataset3$preds) < e) {
z = -(dataset3$EA + dataset3$preds) / S
} else
z = -(dataset3$EA + dataset3$preds) /(dataset3$tau*dataset3$preds)
}
}
As is noticeable, I am creating a function that creates an adjusted standardised score. Typically, the Z-score is defined as (x - mean)/sd.
In this case, we are taking into account the fact that x is a random variable which is nonstationary. Therefore, the measure must be estimated on a rolling basis and constructed iteratively over the number of observations.
df is the dataset of interest, k is the window length used for estimating the rolling linear models, and e is simply a value used to test whether the adjusted standard deviation is too small to use the coefficient of variation rather than an alternative standard deviation that is adjusted for heteroscedasticity.
I am getting an error when I run my function with the following test measures:
t = seq(0,15,1)
r = (100+50*sin(0.8*t))
EA = rnorm(0:15)
df = data.frame(t,r,EA)
test = z7(df, 3, 0.00000000001)
The error is:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 14, 0
The traceback is:
5.
stop(gettextf("arguments imply differing number of rows: %s",
paste(unique(nrows), collapse = ", ")), domain = NA)
4.
data.frame(..., check.names = FALSE)
3.
cbind(deparse.level, ...)
2.
cbind(dataset, tau)
1.
z7(df, 3, 1e-11)
How can I fix this error? Also, is there a way to simplify my code?
Thank you.
I think the error occurs at line
tau = ((1 + 1 / (4*(n2))) * (dataset$apply.res..1..sd./dataset$apply.res..1..mean.))
I changed it to
tau = ((1 + 1 / (4*(n2))) * (dataset$`apply(res, 1, sd)`/dataset$`apply(res, 1, mean)`))
And in the last for loop I guess there is a problem with dataset3$preds
>dataset3$preds
NULL
And at the beginning you declared r = df$ROA but I think this sets r equal to NULL.
Hope that was useful!
Greetings
WW