How do I calculate cronbach's alpha on multiply imputed data? - r

I have run a multiple imputation (m=45, 10 iterations) using the MICE package, and want to calculate the cronbach's alpha for a number of ordinal scales in the data. Is there a function in r that could assist me in calculating the alpha coefficient across the imputed datasets in a manner that would satisfy Rubin's rules for pooling estimates?

We may exploit pool.scalar from the mice package, which performs pooling of univariate estimates according to Rubin's rules.
Since you have not provided a reproducible example yourself, I will provide one.
set.seed(123)
# sample survey responses
df <- data.frame(
x1 = c(1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3),
x2 = c(1,1,1,2,3,3,2,3,3,3,
1,1,1,2,3,3,2,3,3,3,
1,2,2,3,2,2,3,3,2,3),
x3 = c(1,1,2,1,2,3,3,3,2,3,
1,1,2,1,2,3,3,3,2,3,
1,2,2,3,2,2,3,3,2,3)
)
# function to column-wise generate missing values (MCAR)
create_missings <- function(data, prob) {
x <- replicate(ncol(data),rbinom(nrow(data), 1, prob))
for(k in 1:ncol(data)) {
data[, k] <- ifelse(x[, k] == 1, NA, data[,k])
}
data
}
df <- create_missings(df, prob = 0.2)
# multiple imputation ----------------------------------
library(mice)
imp <- mice(df, m = 10, maxit = 20)
# extract the completed data in long format
implong <- complete(imp, 'long')
We need a function to compute cronbach's alpha and obtain an estimate of the standard error of alpha, which can be used in a call to pool.scalar() later on. Since there is no available formula with which we can analytically estimate the standard error of alpha, we also need to deploy a bootstrapping procedure to estimate this standard error.
The function cronbach_fun() takes the following arguments:
list_compl_data: a character string specifying the list of completed data from a mids object.
boot: a logical indicating whether a non-parametrical bootstrap should be conducted.
B: an integer specifying the number of bootstrap samples to be taken.
ci: a logical indicating whether a confidence interval around alpha should be estimated.
cronbach_fun <- function(list_compl_data, boot = TRUE, B = 1e4, ci = FALSE) {
n <- nrow(list_compl_data); p <- ncol(list_compl_data)
total_variance <- var(rowSums(list_compl_data))
item_variance <- sum(apply(list_compl_data, 2, sd)^2)
alpha <- (p/(p - 1)) * (1 - (item_variance/total_variance))
out <- list(alpha = alpha)
boot_alpha <- numeric(B)
if (boot) {
for (i in seq_len(B)) {
boot_dat <- list_compl_data[sample(seq_len(n), replace = TRUE), ]
total_variance <- var(rowSums(boot_dat))
item_variance <- sum(apply(boot_dat, 2, sd)^2)
boot_alpha[i] <- (p/(p - 1)) * (1 - (item_variance/total_variance))
}
out$var <- var(boot_alpha)
}
if (ci){
out$ci <- quantile(boot_alpha, c(.025,.975))
}
return(out)
}
Now that we have our function to do the 'heavy lifting', we can run it on all m completed data sets, after which we can obtain Q and U (which are required for the pooling of the estimates). Consult ?pool.scalar for more information.
m <- length(unique(implong$.imp))
boot_alpha <- rep(list(NA), m)
for (i in seq_len(m)) {
set.seed(i) # fix random number generator
sub <- implong[implong$.imp == i, -c(1,2)]
boot_alpha[[i]] <- cronbach_fun(sub)
}
# obtain Q and U (see ?pool.scalar)
Q <- sapply(boot_alpha, function(x) x$alpha)
U <- sapply(boot_alpha, function(x) x$var)
# pooled estimates
pool_estimates <- function(x) {
out <- c(
alpha = x$qbar,
lwr = x$qbar - qt(0.975, x$df) * sqrt(x$t),
upr = x$qbar + qt(0.975, x$df) * sqrt(x$t)
)
return(out)
}
Output
# Pooled estimate of alpha (95% CI)
> pool_estimates(pool.scalar(Q, U))
alpha lwr upr
0.7809977 0.5776041 0.9843913

Related

How to perform bootstrapping for estimation and inference of quantile regression using multiply imputed data in R?

I am trying to manually pool results from quantile regression models run on multiply imputed data in R using mice. I make use of a bootstrapping procedure to get 95% CIs and P values of the model terms, in which model parameters and their standard errors are obtained after sampling a certain number of rows that is equal to the unique number of participants in my data set. This procedure is repeated 500 times for each of the m imputed data sets. Then, as a last step, I pool the estimated coefficients and their standard errors of the resulting 500 * m regression models according to Rubin's rules (1987) (see e.g. https://bookdown.org/mwheymans/bookmi/rubins-rules.html). To speed things up, I use foreach to split up the analyses over multiple processor cores and for to loop over the m imputed data sets.
However, there seems to be a flaw in the part wherein the results are pooled. When I look at the pooled results, I observe that the P values are not in accordance with the 95% CIs (e.g. P < 0.05 when 0 is included in the 95% CI).
To illustrate this issue, I have made a reproducible example, using these publicly available data: https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv
Because there are no missing data in this data set, I introduce them myself and impute the data (m = 10 multiply imputed data sets with 20 iterations). I use set.seed for reproducibility.
Note that I use lm instead of quantreg::rq in this example.
# load data
projdir <- "my_directory"
d <- read.csv(file = file.path(projdir, 'heart_failure_clinical_records_dataset.csv'))
#### introduce missing values
set.seed(1)
# age
age_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$age[age_miss_tag == 1] <- NA # MCAR
# serum creatinine
creat_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$serum_creatinine[creat_miss_tag == 1 & d$anaemia == 0] <- NA # MAR
# CK
CK_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$creatinine_phosphokinase[CK_miss_tag & d$platelets > median(d$platelets)] <- NA # MAR
# platelets
platelets_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$platelets[platelets_miss_tag == 1] <- NA # MCAR
library(mice); library(mitml); library(miceadds); library(splines); library(foreach); library(doParallel)
# impute data
imp <- mice(d, maxit = 20, m = 10, seed = 2)
# log creatinine
implong <- complete(imp, 'long', include = FALSE)
implong$log_creat <- log(implong$serum_creatinine)
imp <- miceadds::datlist2mids(split(implong, implong$.imp))
# compute values for Boundary.knots
temp <- complete(imp, 'long', include = FALSE)
B_knots <- rowMeans(sapply(split(temp, temp$.imp), function(x) {
quantile(x$age, c(0.1, 0.9))
}))
# Convert mids object into a datlist
longlist <- miceadds::mids2datlist(imp)
# fit model based on origial data and use the terms in the below foreach loop
# in order to fix the position of the inner knots
fit_orig <- lm(log_creat ~
# Main effects
ns(age, df = 2, B = c(B_knots[1], B_knots[2])) * sex,
data = longlist[[1]])
To further speed things up, I use OLS instead of quantile regression here and parallelize the process.
# make cluster used in foreach
cores_2_use <- detectCores() - 1
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, iseed = 9956)
registerDoParallel(cl)
# No. of bootstrap samples to be taken
n_iter <- 500
boot.1 <- c()
for(k in seq_along(longlist)){
boot.1[[k]] <- foreach(i = seq_len(n_iter),
.combine = rbind,
.packages = c('mice', 'mitml', 'splines')) %dopar% {
# store data from which rows can be samples
longlist0 <- longlist[[k]]
# set seed for reproducibility
set.seed(i)
# sample rows
boot_dat <- longlist0[sample(1:nrow(longlist0), replace = TRUE), ]
# linear regression model based on sampled rows
fit1 <- lm(terms(fit_orig), data = boot_dat)
# save coefficients
fit1$coef
}
}
stopCluster(cl)
As a last step, I pool the results according to Rubin's rules.
n_cols <- dim(boot.1[[1]])[2]
list <- c()
for(i in seq_len(n_cols)) {
# extract coefficients
parameter <- lapply(boot.1, function(x){
x[,i]
})
m <- length(parameter)
for(k in seq_len(m)) {
names(parameter[[k]]) <- NULL
}
Q <- sapply(parameter, mean)
U <- sapply(parameter, var) # (standard error of estimate)^2
#### Pooling
# Pooled univariate estimate
qbar <- mean(Q)
# Mean of the variances (i.e. the pooled within-imputation variance)
ubar <- mean(U)
# Between-imputation variance
btw_var <- var(Q)
# Total variance of the pooled estimated
tot_var <- ubar + btw_var + (btw_var / m)
# Relative increase in variance due to non-response
r_var <- (btw_var + (btw_var / m)) / ubar
# Fraction of missing information
lambda <- (btw_var + (btw_var / m)) / tot_var
# degrees of freedom for the t-distribution according to Rubin (1987)
df_old <- (m - 1) / lambda^2
# sample size in the imputed data sets
n_sample <- nrow(longlist[[1]])
# observed degrees of freedom
df_observed <- (((n_sample - n_cols) + 1) / ((n_sample - n_cols) + 3)) *
(n_sample - n_cols) * (1 - lambda)
# adjusted degrees of freedom according to Barnard & Rubin (1999)
df_adjusted <- (df_old * df_observed) / (df_old + df_observed)
# 95% confidence interval of qbar
lwr <- qbar - qt(0.975, df_adjusted) * sqrt(tot_var)
upr <- qbar + qt(0.975, df_adjusted) * sqrt(tot_var)
# F statistic
q <- ((0 - qbar)^2 / tot_var)^2
# Significance level associated with the null value Q[0]
p_value <- pf(q, df1 = 1, df2 = df_adjusted, lower.tail = FALSE)
list[[i]] <- cbind(qbar, lwr, upr, p_value)
}
names(list) <- colnames(boot.1[[1]])
list
Obviously, the P value shown below is not in accordance with the 95% CI (as 0 is included in the CI, so the P value should be ≥0.05).
> list
$`(Intercept)`
qbar lwr upr p_value
[1,] 0.06984595 -0.02210231 0.1617942 0.008828337
EDIT (29 Dec 2021)
As #Gerko Vink notes in his answer, multiple imputation and bootstrapping both induce variance. The variance induced by imputation is taken care of by Rubin's rules, the bootstrap variance is not. Unfortunately, mice::pool will not work with the output returned by quantreg::rq.
I am aware of constructing bootstrap CIs based on a naive percentile-based approach as shown in this post, but I am inclined to think this is not the correct approach to proceed with.
Does anyone know how to appropriately take care of the extra variance induced by bootstrapping when using rq?
EDIT (30 Dec 2021)
Inspired by this recent post, I decided not to hit the road of bootstrapping anymore, but instead manually extract the point estimates and variances from each of the imputed data sets and pool them using Rubin's rules. I have posted this approach as answer below. Any input on how to appropriately take care of the extra variance induced by bootstrapping when using rq is still very welcome though!
Bootstrapping and multiple imputation both induce variance. The imputation variance is taken care of by Rubin's rules for parameters with normal sampling distributions. The bootstrap variance is not.
Two remarks:
First, there is a small error in your code. You're calculating the bootstrap variance about Q in U <- sapply(parameter, var). No need for U <- U/n_iter. U is already the variance and sapply(parameter, sd) would yield the bootstrapped standard error.
Second, you're using bootstrap parameters to calculate a parametric interval and p-value. That seems needlessly complicated and, as you can see, potentially problematic. Why not calculate the bootstrap CI?
See also this link for some inspiration with respect to different means of calculating the CIs and their respective validity.
A small sim that demonstrates that you cannot expect both to be identical for a finite set of bootstrap replications.
library(purrr)
library(magrittr)
#fix seed
set.seed(123)
#some data
n = 1000
d <- rnorm(n, 0, 1)
# ci function
fun <- function(x){
se <- var(x)/length(x)
lwr <- mean(x) - 1.96 * se
upr <- mean(x) + 1.96 * se
ci <- c(lwr, upr)
return(ci)
}
# bootstrap
boot <- replicate(500,
d[sample(1:1000, 1000, replace = TRUE)],
simplify = FALSE)
# bootstrapped ci's based on parameters
boot.param.ci <- boot %>%
map(~.x %>% fun) %>%
do.call("rbind", args = .)
# bootstrap CI
boot.ci <- boot %>%
map(~.x %>% mean) %>%
unlist %>%
quantile(c(.025, .975))
# Overview
data.frame(param = fun(d),
boot.param = boot.param.ci %>% colMeans,
boot.ci = boot.ci)
#> param boot.param boot.ci
#> 2.5% 0.01420029 0.01517527 -0.05035913
#> 97.5% 0.01805545 0.01904181 0.07245449
Created on 2021-12-22 by the reprex package (v2.0.1)
The following reprex also demonstrates that the bootstrap applied to the imputed data yields different variance estimates under the same pooling rules.
library(purrr)
library(magrittr)
library(mice)
#fix seed
set.seed(123)
imp <- mice(boys,
m = 10,
printFlag = FALSE)
imp %>%
complete("all") %>%
map(~.x %$%
lm(age ~ hgt + hc)) %>%
pool %>%
summary(conf.int = TRUE)
#> term estimate std.error statistic df p.value 2.5 %
#> 1 (Intercept) -1.9601179 0.809167659 -2.422388 682.5182 0.01567825 -3.5488747
#> 2 hgt 0.1690468 0.002784939 60.700342 572.1861 0.00000000 0.1635768
#> 3 hc -0.2138941 0.021843724 -9.792018 639.0432 0.00000000 -0.2567883
#> 97.5 %
#> 1 -0.3713610
#> 2 0.1745167
#> 3 -0.1710000
imp %>%
complete("all") %>%
map(~.x %>%
.[sample(1:748, 748, replace = TRUE), ] %$%
lm(age ~ hgt + hc)) %>%
pool %>%
summary(conf.int = TRUE)
#> term estimate std.error statistic df p.value 2.5 %
#> 1 (Intercept) -1.9810146 1.253312293 -1.580623 22.57546 1.278746e-01 -4.5763892
#> 2 hgt 0.1689181 0.004124538 40.954423 24.47123 0.000000e+00 0.1604141
#> 3 hc -0.2133606 0.033793045 -6.313743 22.29686 2.217445e-06 -0.2833890
#> 97.5 %
#> 1 0.6143599
#> 2 0.1774221
#> 3 -0.1433322
Created on 2021-12-22 by the reprex package (v2.0.1)
For quantile regression, mice::pool will not work with the output returned by quantreg::rq, because (according to this post) there is no agreed upon method to calculate standard errors, which are required to pool results under multiple imputation.
An ad hoc solution would be to manually extract the point estimates and variances from each of the imputed data sets and pool them using Rubin's rules.
First, a reprex using lm to verify whether results from the manual approach and mice::pool are identical.
library(mice)
imp <- mice(nhanes, print = FALSE, seed = 123)
# fit linear model
fit <- with(imp, lm(bmi ~ chl + hyp))
# manually pool univariate estimates using Rubin's rules
pool_manual <- \(model_object) {
m <- length(model_object$analyses)
Q <- sapply(model_object$analyses, \(x) summary(x)$coefficients[, 'Estimate'])
U <- sapply(model_object$analyses, \(x) (summary(x)$coefficients[, 'Std. Error'])^2)
qbar <- rowMeans(Q)
ubar <- rowMeans(U)
btw_var <- apply(Q, 1, var)
tot_var <- ubar + btw_var + (btw_var / m)
lambda <- (btw_var + (btw_var / m)) / tot_var
df_old <- (m - 1) / lambda^2
n_sample <- length(residuals(model_object$analyses[[1]]))
n_cols <- dim(Q)[1]
df_com <- n_sample - n_cols
df_observed <- ((df_com + 1) / (df_com + 3)) * df_com * (1 - lambda)
df_adjusted <- (df_old * df_observed) / (df_old + df_observed)
lwr <- qbar - qt(0.975, df_adjusted) * sqrt(tot_var)
upr <- qbar + qt(0.975, df_adjusted) * sqrt(tot_var)
q <- (0 - qbar)^2 / tot_var
p_value <- pf(q, df1 = 1, df2 = df_adjusted, lower.tail = FALSE)
df <- data.frame(noquote(rownames(Q)), qbar, lwr, upr, p_value)
rownames(df) <- NULL
names(df) <- c('term', 'estimate', '2.5 %', '97.5 %', 'p.value')
return(df)
}
Verify.
> pool_manual(fit)
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 21.78583831 8.99373786 34.57793875 0.004228746
2 chl 0.03303449 -0.02812005 0.09418903 0.254696358
3 hyp -1.07291395 -5.57406829 3.42824039 0.624035769
> extract <- c('term', 'estimate', '2.5 %', '97.5 %', 'p.value')
> summary(pool(fit), conf.int = TRUE)[, extract]
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 21.78583831 8.99373786 34.57793875 0.004228746
2 chl 0.03303449 -0.02812005 0.09418903 0.254696358
3 hyp -1.07291395 -5.57406829 3.42824039 0.624035769
Quantile regression
Now, let's pool results from rq for the expected median of the outcome.
library(quantreg)
# fit quantile regression model
fit <- with(imp, rq(bmi ~ chl + hyp, tau = 0.5))
To be able to pool results from rq, only the summary method used to extract point estimates and variances from each of the imputed data sets needs to be adjusted in pool_manual.
Q <- sapply(model_object$analyses, \(x) summary.rq(x, covariance = TRUE)$coefficients[, 'Value'])
U <- sapply(model_object$analyses, \(x) (summary.rq(x, covariance = TRUE)$coefficients[, 'Std. Error'])^2)
Result
> pool_manual(fit)
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 22.23452856 0.8551626 43.6138945 0.04461337
2 chl 0.03487894 -0.0857199 0.1554778 0.47022312
3 hyp -1.43636147 -6.0666990 3.1939761 0.52455041
> summary(pool(fit), conf.int = TRUE)[, extract]
Error in rq.fit.br(x, y, tau = tau, ci = TRUE, ...) :
unused arguments (effects = "fixed", parametric = TRUE, exponentiate = FALSE)

Maximum Likelihood Estimation - Choosing between nlm and nloptr

I am attempting to find three parameters by minimizing a negative log-likelihood function in R. I have attempted this using two different commands: nlm and nloptr. I get different results for both of these. As such, I was wondering if it is normal for them to differ and if so, which of the commands I should use for this specific question. Another issue I have is that I have to be able to estimate the standard errors of each the parameters and have no idea how to do this.
Here is my code. The wage, unemployment_duration and employed vectors are all vectors of pre-existing data:
####################### Guessing initial values ###################
mu_guess<- 10
sigma_w_guess<- 3.61
lambda_guess<- 5
#Vector of inputs
initial_guess <- c(mu_guess,sigma_w_guess,lambda_guess)
#Lower and upper bounds
lower_bound <- c(0,0,0)
upper_bound <- c(Inf,Inf,Inf)
####################### Specification 1 ###################################
negative_log_likelihood <- function(x){
#Parameters to estimate
mu <- x[1]
sigma_w <- x[2]
lambda <- x[3]
#Rate of leaving unemployment
hu <- lambda*(1 - plnorm(reservation_wage, meanlog = mu,
sdlog = sigma_w, lower.tail = TRUE,
log.p = FALSE))
#Predefining vectors
fu<-matrix(0,1458,1)
fu_hat <- matrix(0,1458,1)
fa<- matrix(0,1458,1)
L <- matrix(0,1458,1)
for (i in 1:1458){
#Probability of a completed spell (vector)
fu[i,1] = hu*exp(-hu*unemployment_duration[i,1])
#Probability of an ongoing spell (vector)
fu_hat[i,1] = exp(-hu*unemployment_duration[i,1])
#Distribution of wage accounting for truncation at reservation wage
fa[i,1] = dlnorm(wage[i,1], meanlog = mu, sdlog = sigma_w,
log = FALSE)/(1 - plnorm(reservation_wage,
meanlog = mu, sdlog = sigma_w,
lower.tail = TRUE,log.p = FALSE))
#log-ikelihood of an observation
if(employed[i,1]==1){L[i,1]=log(fu[i,1]*fa[i,1])}
else{L[i,1]=log(fu_hat[i,1])}
}
#Log-likelihood function
neg_log_lik <- -sum(L)
return(neg_log_lik)
}

Predict segmented lm outside of package

I have an array of outputs from hundreds of segmented linear models (made using the segmented package in R). I want to be able to use these outputs on new data, using the predict function. To be clear, I do not have the segmented linear model objects in my workspace; I just saved and reimported the relevant outputs (e.g. the coefficients and breakpoints). For this reason I can't simply use the predict.segmented function from the segmented package.
Below is a toy example based on this link that seems promising, but does not match the output of the predict.segmented function.
library(segmented)
set.seed(12)
xx <- 1:100
zz <- runif(100)
yy <- 2 + 1.5*pmax(xx-35,0) - 1.5*pmax(xx-70,0) +
15*pmax(zz-0.5,0) + rnorm(100,0,2)
dati <- data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
o<-## S3 method for class 'lm':
segmented(out.lm,seg.Z=~x,psi=list(x=c(30,60)),
control=seg.control(display=FALSE))
# Note that coefficients with U in the name are differences in slopes, not slopes.
# Compare:
slope(o)
coef(o)[2] + coef(o)[3]
coef(o)[2] + coef(o)[3] + coef(o)[4]
# prediction
pred <- data.frame(x = 1:100)
pred$dummy1 <- pmax(pred$x - o$psi[1,2], 0)
pred$dummy2 <- pmax(pred$x - o$psi[2,2], 0)
pred$dummy3 <- I(pred$x > o$psi[1,2]) * (coef(o)[2] + coef(o)[3])
pred$dummy4 <- I(pred$x > o$psi[2,2]) * (coef(o)[2] + coef(o)[3] + coef(o)[4])
names(pred)[-1]<- names(model.frame(o))[-c(1,2)]
# compute the prediction, using standard predict function
# computing confidence intervals further
# suppose that the breakpoints are fixed
pred <- data.frame(pred, predict(o, newdata= pred,
interval="confidence"))
# Try prediction using the predict.segment version to compare
test <- predict.segmented(o)
plot(pred$fit, test, ylim = c(0, 100))
abline(0,1, col = "red")
# At least one segment not being predicted correctly?
Can I use the base r predict() function (not the segmented.predict() function) with the coefficients and break points saved from segmented linear models?
UPDATE
I figured out that the code above has issues (don't use it). Through some reverse engineering of the segmented.predict() function, I produced the design matrix and use that to predict values instead of directly using the predict() function. I do not consider this a full answer of the original question yet because predict() can also produce confidence intervals for the prediction, and I have not yet implemented that--question still open for someone to add confidence intervals.
library(segmented)
## Define function for making matrix of dummy variables (this is based on code from predict.segmented())
dummy.matrix <- function(x.values, x_names, psi.est = TRUE, nameU, nameV, diffSlope, est.psi) {
# This function creates a model matrix with dummy variables for a segmented lm with two breakpoints.
# Inputs:
# x.values: the x values of the segmented lm
# x_names: the name of the column of x values
# psi.est: this is legacy from the predict.segmented function, leave it set to 'TRUE'
# obj: the segmented lm object
# nameU: names (class character) of 3rd and 4th coef, which are "U1.x" "U2.x" for lm with two breaks. Example: names(c(obj$coef[3], obj$coef[4]))
# nameV: names (class character) of 5th and 6th coef, which are "psi1.x" "psi2.x" for lm with two breaks. Example: names(c(obj$coef[5], obj$coef[6]))
# diffSlope: the coefficients (class numeric) with the slope differences; called U1.x and U2.x for lm with two breaks. Example: c(o$coef[3], o$coef[4])
# est.psi: the estimated break points (class numeric); these are the estimated breakpoints from segmented.lm. Example: c(obj$psi[1,2], obj$psi[2,2])
#
n <- length(x.values)
k <- length(est.psi)
PSI <- matrix(rep(est.psi, rep(n, k)), ncol = k)
newZ <- matrix(x.values, nrow = n, ncol = k, byrow = FALSE)
dummy1 <- pmax(newZ - PSI, 0)
if (psi.est) {
V <- ifelse(newZ > PSI, -1, 0)
dummy2 <- if (k == 1)
V * diffSlope
else V %*% diag(diffSlope)
newd <- cbind(x.values, dummy1, dummy2)
colnames(newd) <- c(x_names, nameU, nameV)
} else {
newd <- cbind(x.values, dummy1)
colnames(newd) <- c(x_names, nameU)
}
# if (!x_names %in% names(coef(obj.seg)))
# newd <- newd[, -1, drop = FALSE]
return(newd)
}
## Test dummy matrix function----------------------------------------------
set.seed(12)
xx<-1:100
zz<-runif(100)
yy<-2+1.5*pmax(xx-35,0)-1.5*pmax(xx-70,0)+15*pmax(zz-.5,0)+rnorm(100,0,2)
dati<-data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
#1 segmented variable, 2 breakpoints: you have to specify starting values (vector) for psi:
o<-segmented(out.lm,seg.Z=~x,psi=c(30,60),
control=seg.control(display=FALSE))
slope(o)
plot.segmented(o)
summary(o)
# Test dummy matrix fn with the same dataset
newdata <- dati
nameU1 <- c("U1.x", "U2.x")
nameV1 <- c("psi1.x", "psi2.x")
diffSlope1 <- c(o$coef[3], o$coef[4])
est.psi1 <- c(o$psi[1,2], o$psi[2,2])
test <- dummy.matrix(x.values = newdata$x, x_names = "x", psi.est = TRUE,
nameU = nameU1, nameV = nameV1, diffSlope = diffSlope1, est.psi = est.psi1)
# Predict response variable using matrix multiplication
col1 <- matrix(1, nrow = dim(test)[1])
test <- cbind(col1, test) # Now test is the same as model.matrix(o)
predY <- coef(o) %*% t(test)
plot(predY[1,])
lines(predict.segmented(o), col = "blue") # good, predict.segmented gives same answer

Multivariate ttest using r and winbug

How can I do difference in means (ttest) for a multivariate using R and WinBUGS14
I have a multivariate outcome y and the categorical variable x. I am able to get the means of the MCMC sampled values from the multivariate using the code below, but how can I test for the difference in means by variable x?
Here is the R code
library(R2WinBUGS)
library(MASS) # need to mvrnorm
library(MCMCpack) # need for rwish
# Generate synthetic data
N <- 500
#we use this to simulate the data
S <- matrix(c(1,.2,.2,5),nrow=2)
#Produces one or more samples from the specified multivariate normal distribution.
#produces 2 variables with the given distribution
y <- mvrnorm(n=N,mu=c(1,3),Sigma=S)
x <- rbinom(500, 1, 0.5)
# Set up for WinBUGS
#set up of the mu0 values
mu0 <- as.vector(c(0,0))
#covariance matrices
# the precisions
S2 <- matrix(c(1,0,0,1),nrow=2)/1000 #precision for unkown mu
# precison matrix to be passes to the wishart distribution for the tau
S3 <- matrix(c(1,0,0,1),nrow=2)/10000
#the data for the winbug code
data <- list("y","N","S2","S3","mu0")
inits <- function(){
list( mu=mvrnorm(1,mu0,matrix(c(10,0,0,10),nrow=2) ),
tau <- rwish(3,matrix(c(.02,0,0,.04),nrow=2)) )
}
# Run WinBUGS
bug_file <- paste0(getwd(), "/codes/mult_normal.bug")
multi_norm.sim <- bugs(data,inits,model.file=bug_file,
parameters=c("mu","tau"),n.chains = 2,n.iter=4010,n.burnin=10,n.thin=1,
bugs.directory="../WinBUGS14/",codaPkg=F)
print(multi_norm.sim,digits=3)
and this is the WinBUGS14 code called mult_normal.bug
model{
for(i in 1:N)
{
y[i,1:2] ~ dmnorm(mu[],tau[,])
}
mu[1:2] ~ dmnorm(mu0[],S2[,])
#parameters of a wishart
tau[1:2,1:2] ~ dwish(S3[,],3)
}
2 Steps:
Load a function to run the t.test using sample statistics instead of doing it directly.
t.test2 <- function(m1,m2,s1,s2,n1,n2,m0=0,equal.variance=FALSE)
{
if( equal.variance==FALSE )
{
se <- sqrt( (s1^2/n1) + (s2^2/n2) )
# welch-satterthwaite df
df <- ( (s1^2/n1 + s2^2/n2)^2 )/( (s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1) )
} else
{
# pooled standard deviation, scaled by the sample sizes
se <- sqrt( (1/n1 + 1/n2) * ((n1-1)*s1^2 + (n2-1)*s2^2)/(n1+n2-2) )
df <- n1+n2-2
}
t <- (m1-m2-m0)/se
dat <- c(m1-m2, se, t, 2*pt(-abs(t),df))
names(dat) <- c("Difference of means", "Std Error", "t", "p-value")
return(dat)
}
Parse out the mean and standard deviation of the things we want to test against x, then pass them to the function.
mu1 <- as.data.frame(multi_norm.sim$mean)$mu[1]
sdmu1 <- multi_norm.sim$sd$mu[1]
t.test2( mean(x), as.numeric(mu1), s1 = sd(x), s2 = sdmu1, 500, 500)
Difference of means Std Error t p-value
-4.950656e-01 2.246905e-02 -2.203323e+01 5.862968e-76
When I copied the results from my screen to SO it was hard to make the labels of the results properly spaced apart, my apologies.

Maximum likelihood estimation of the log-normal distribution using R

I'm trying to estimate a linear model with a log-normal distributed error term. I already have working code for a linear model with normally distributed errors:
library(Ecdat)
library(assertthat)
library(maxLik)
# Load the data
data(Wages1)
# Check what R says
summary(lm(wage ~ school + exper + sex, data = Wages1))
# Use maxLik from package maxLik
# The likelihood function
my_log_lik_pos <- function(theta, data){
y <- data[, 1]
x <- data[, -1]
beta <- head(theta, -1)
sigma <- tail(theta, 1)
xb <- x%*%beta
are_equal(dim(xb), c(nrow(my_data), 1))
return(sum(log(dnorm(y, mean = xb, sd = sigma))))
}
# Bind the data
my_data <- cbind(Wages1$wage, 1, Wages1$school, Wages1$exper, Wages1$sex)
my_problem <- maxLik(my_log_lik_pos, data = my_data,
start = rep(1,5), method = "BFGS")
summary(my_problem)
I get approximately the same results. Now I try to do the same, but using the log-normal likelihood. For this, I have to first simulate some data:
true_beta <- c(0.1, 0.2, 0.3, 0.4, 0.5)
ys <- my_data[, -1] %*% head(true_beta, -1) +
rlnorm(nrow(my_data), 0, tail(true_beta, 1))
my_data_2 <- cbind(ys, my_data[, -1])
And the log-likelihood function:
my_log_lik_lognorm <- function(theta, data){
y <- data[, 1]
x <- data[, -1]
beta <- head(theta, -1)
sigma <- tail(theta, 1)
xb <- x%*%beta
are_equal(dim(xb), c(nrow(data), 1))
return(sum(log(dlnorm(y, mean = xb, sd = sigma))))
}
my_problem2 <- maxLik(my_log_lik_lognorm, data = my_data_2,
start = rep(0.2,5), method = "BFGS")
summary(my_problem2)
The estimated parameters should be around the values of true_beta, but for some reason I find completely different values. I tried with different methods, different starting values but to no avail. I'm sure that I'm missing something obvious, but I don't see what.
Am I right to assume that the log-likelihood of the log-normal distribution is:
sum(log(dlnorm(y, mean = .., sd = ...))
Unless I'm mistaken, this is the definition of the log-likelihood (sum of the logs of the densities).
I found the issue: it seems the problem is not my log-likelihood function. When I try to estimate the model with glm:
summary(glm(ys ~ school + exper + sex, family=gaussian(link="log"), data=Wages1))
I get the same result as with maxLik and my log-likelihood. It would seem the problem comes from when I tried to simulate some data:
ys <- my_data[, -1] %*% head(true_beta, -1) +
rlnorm(nrow(my_data), 0, tail(true_beta, 1))
The correct way to simulate the data:
ys <- rlnorm(nrow(my_data), my_data[, -1] %*% head(true_beta, -1), tail(true_beta, 1))
Now everything works!

Resources