According to the documentation of the mice package, if we want to impute data when we're interested in interaction terms we need to use passive imputation. Which is done the following way.
library(mice)
nhanes2.ext <- cbind(nhanes2, bmi.chl = NA)
ini <- mice(nhanes2.ext, max = 0, print = FALSE)
meth <- ini$meth
meth["bmi.chl"] <- "~I((bmi-25)*(chl-200))"
pred <- ini$pred
pred[c("bmi", "chl"), "bmi.chl"] <- 0
imp <- mice(nhanes2.ext, meth = meth, pred = pred, seed = 51600, print = FALSE)
It is said that
Imputations created in this way preserve the interaction of bmi with chl
Here, a new variable called bmi.chl is created in the original dataset. The meth step tells how this variable needs to be imputed from the existing ones. The pred step says we don't want to predict bmi and chl from bmi.chl. But now, if we want to apply a model, how do we proceed? Is the product defined by "~I((bmi-25)*(chl-200))" is just a way to control for the imputed values of the main effects, i.e. bmi and chl?
If the model we want to fit is glm(hyp~chl*bmi, family="binomial"), what is the correct way to specify this model from the imputed data? fit1 or fit2?
fit1 <- with(data=imp, glm(hyp~chl*bmi, family="binomial"))
summary(pool(fit1))
Or do we have to use somehow the imputed values of the new variable created, i.e. bmi.chl?
fit2 <- with(data=imp, glm(hyp~chl+bmi+bmi.chl, family="binomial"))
summary(pool(fit2))
With passive imputation, it does not matter if you use the passively imputed variable, or if you re-calculate the product term in your call to glm.
The reason that fit1 and fit2 yield different results in your example is because are not just doing passive imputation for the product term.
Instead you are transforming the two variables befor multiplying (i.e., you calculate bmi-25 and chl-100). As a result, the passively imputed variable bmi.chl does not represent the product term bmi*chl but rather (bmi-25)*(chl-200).
If you just calculate the product term, then fit1 and fit2 yield the same results like they should:
library(mice)
nhanes2.ext <- cbind(nhanes2, bmi.chl = NA)
ini <- mice(nhanes2.ext, max = 0, print = FALSE)
meth <- ini$meth
meth["bmi.chl"] <- "~I(bmi*chl)"
pred <- ini$pred
pred[c("bmi", "chl"), "bmi.chl"] <- 0
pred[c("hyp"), "bmi.chl"] <- 1
imp <- mice(nhanes2.ext, meth = meth, pred = pred, seed = 51600, print = FALSE)
fit1 <- with(data=imp, glm(hyp~chl*bmi, family="binomial"))
summary(pool(fit1))
# > round(summary(pool(fit1)),2)
# est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
# (Intercept) -23.94 38.03 -0.63 10.23 0.54 -108.43 60.54 NA 0.41 0.30
# chl 0.10 0.18 0.58 9.71 0.58 -0.30 0.51 10 0.43 0.32
# bmi 0.70 1.41 0.49 10.25 0.63 -2.44 3.83 9 0.41 0.30
# chl:bmi 0.00 0.01 -0.47 9.67 0.65 -0.02 0.01 NA 0.43 0.33
fit2 <- with(data=imp, glm(hyp~chl+bmi+bmi.chl, family="binomial"))
summary(pool(fit2))
# > round(summary(pool(fit2)),2)
# est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
# (Intercept) -23.94 38.03 -0.63 10.23 0.54 -108.43 60.54 NA 0.41 0.30
# chl 0.10 0.18 0.58 9.71 0.58 -0.30 0.51 10 0.43 0.32
# bmi 0.70 1.41 0.49 10.25 0.63 -2.44 3.83 9 0.41 0.30
# bmi.chl 0.00 0.01 -0.47 9.67 0.65 -0.02 0.01 25 0.43 0.33
This is not surprising because the ~I(bmi*chl) in mice and the bmi*chl in glm do the exact same thing. They merely calculate the product of the two variables.
Remark:
Note that I added a line saying that bmi.chl should be used as a predictor when imputing hyp. Without this step, passive imputation has no purpose because the imputation model would neglect the product term, thus being incongruent with the analysis model.
Related
When running my script I get the following error message: Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases and I'm guessing that is due some negative values?
The script is looping trough a list of csv files and for a small selection of them, the code is working. But for all of them I get the error message. I checked the data and there are some (about 2% of the whole data) negative NDVI values which are always -99999. And I have some soil moisture values which are 0.
I found this solution na.action=na.exclude to add in the lm function:
model <- lm(NDVI ~ T + Prec + soilM, data = BeforeConf)
model <- lm(NDVI ~ T + Prec + soilM, data = BeforeConf, na.action=na.exclude)
But the same error still occurs. Do you have any other solution for this, besides deleting the negative values from the data. Best would be to ignore the whole the not exclude these values in the linear regression (lm) or to ignore the whole csv file. If there are negative values in it.
Missing values in R should be coded as NA. You could use replace,
replace(dat, dat == -99999, NA)
# X1 X2 X3
# 1 1.37 1.30 -0.31
# 2 NA 2.29 -1.78
# 3 0.36 -1.39 -0.17
# 4 0.63 -0.28 1.21
# 5 0.40 NA 1.90
# 6 -0.11 0.64 -0.43
# 7 1.51 -0.28 -0.26
# 8 -0.09 -2.66 -1.76
# 9 2.02 -2.44 NA
# 10 -0.06 1.32 -0.64
what actually works directly in the formula without changing the data.
lm(X1 ~ X2 + X3, replace(dat, dat == -99999, NA))$coefficients
# (Intercept) X2 X3
# 0.61499466 0.06062925 0.25979370
If there are more than one missing code, you could do e.g.:
replace(dat, array(unlist(dat) %in% c(-99999, -88888), dim(dat)), NA)
Data:
set.seed(42)
dat <- data.frame(matrix(round(rnorm(30), 2), 10, 3))
dat[2, 1] <- -99999
dat[5, 2] <- -99999
dat[9, 3] <- -99999
I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is -
cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
bestglm(Xy = winedata,
family = binomial, # binomial family for logistic
IC = "AIC", # Information criteria
method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)
However, this gives the error -
Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"
I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm?
The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot package for cv.glm, and the bestglm package.
The data was processed as -
winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good" #rename 'quality' to 'good'
bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that
It's always good to check what is the class:
class(res.best.logistic$BestModel)
[1] "glm" "lm"
But if you look at the call of res.best.logistic$BestModel:
res.best.logistic$BestModel$call
glm(formula = y ~ ., family = family, data = Xi, weights = weights)
head(res.best.logistic$BestModel$model)
y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 0 7.0 0.27 0.36 20.7 0.045
2 0 6.3 0.30 0.34 1.6 0.049
3 0 8.1 0.28 0.40 6.9 0.050
4 0 7.2 0.23 0.32 8.5 0.058
5 0 7.2 0.23 0.32 8.5 0.058
6 0 8.1 0.28 0.40 6.9 0.050
free.sulfur.dioxide density pH sulphates
1 45 1.0010 3.00 0.45
2 14 0.9940 3.30 0.49
3 30 0.9951 3.26 0.44
4 47 0.9956 3.19 0.40
5 47 0.9956 3.19 0.40
6 30 0.9951 3.26 0.44
You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm:
best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
# take the variable names for best model
best_var = names(best_var[[1]])
new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
fit = glm(new_form,winedata,family="binomial")
best.cv.err<- cv.glm(winedata,fit,cost1, K=10)
I have questions about multivariable cox regression analysis including non-binary categorical variables.
My data consists of several variables, and some of them are binary (like sex, and age over 70, etc..)
whereas the rest of them are not (for example, ECOG)
I tried both analyse_multivariate function and coxph function, but it seems that I can only get overall hazard ratios regarding non-categorical variables, but I'd like to know both overall hazard ratios for the variable and individual hazard ratios for the subcategories in the variable (like hazard ratios for ECOG 0, ECOG 1, ECOG 2, and for overall ECOG)
What I tried in the process is like this:
(1)
ECOG = as.factor(df$ECOG)
analyse_multivariate(data=df,
time_status = vars(df$OS, df$survival_status==1),
covariates = vars(df$age70, df$sex, ECOG),
reference_level_dict = c(ECOG==0))
and the result is like this:
Hazard Ratios:
factor.id factor.name factor.value HR Lower_CI Upper_CI Inv_HR Inv_Lower_CI Inv_Upper_CI
df$age70 df$age70 <continuous> 1.07 0.82 1.41 0.93 0.71 1.22
ECOG:4 ECOG 4 1.13 0.16 8.19 0.89 0.12 6.43
df$sex df$sex <continuous> 1.87 0.96 3.66 0.53 0.27 1.04
ECOG:1 ECOG 1 2.14 1.63 2.81 0.47 0.36 0.61
ECOG:3 ECOG 3 12.12 7.83 18.76 0.08 0.05 0.13
ECOG:2 ECOG 2 13.72 4.92 38.26 0.07 0.03 0.2
(2)
analyse_multivariate(data=df,
time_status = vars(df$OS, df$survival_status==1),
covariates = vars(df$age70, df$sex, df$ECOG),
reference_level_dict = c(ECOG==0))
and the result is:
Hazard Ratios:
factor.id factor.name factor.value HR Lower_CI Upper_CI Inv_HR Inv_Lower_CI Inv_Upper_CI
df$age70 df$age70 <continuous> 0.89 0.68 1.16 1.13 0.86 1.47
df$sex df$sex <continuous> 1.87 0.96 3.65 0.53 0.27 1.04
df$ECOG df$ECOG <continuous> 1.9 1.69 2.15 0.53 0.47 0.59
Does it make sense if I use a p-value for ECOG in total from (2) and consider ECOG as a significant variable if its p-value is <0.05, and combine individual hazard ratios for individual ECOG status from (1)?
like for generating a table like followings:
p-value 0.01
ECOG 1 Reference
ECOG 2 13.72 (4.92-38.26)
ECOG 3 12.12 (7.83-18.76)
ECOG 4 1.13 (0.16-8.19)
I believe there are better solutions but couldn't find one.
Any comments would be appreciated!
Thank you in advance.
Short answer is no. In (2), it is a continuous response, meaning you expect the log odds ratio of survival to have a linear relationship with ECOG, whereas in (1) you expect every level (1 to 4) to have a different effect on survival. To test the variable ECOG collective, you can do an anova:
library(survivalAnalysis)
data = survival::lung
data$ECOG = factor(data$ph.ecog)
data$sex = factor(data$sex)
fit1 = data %>%
analyse_multivariate(vars(time, status),
covariates = vars(age, sex, ECOG, wt.loss))
anova(fit1$coxph)
Analysis of Deviance Table
Cox model: response is Surv(time, status)
Terms added sequentially (first to last)
loglik Chisq Df Pr(>|Chi|)
NULL -675.02
age -672.36 5.3325 1 0.020931 *
sex -667.82 9.0851 1 0.002577 **
ECOG -660.26 15.1127 3 0.001723 **
wt.loss -659.31 1.9036 1 0.167680
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
JAGS
I have an intercept-only logistic model in JAGS, defined as follows:
model{
for(i in 1:Ny){
y[i] ~ dbern(mu[s[i]])
}
for(j in 1:Ns){
mu[j] <- ilogit(b0[j])
b0[j] ~ dnorm(0, sigma)
}
sigma ~ dunif(0, 100)
}
When I plot the posterior distribution of b0 collapsing across all subjects (i.e., all b0[j]), my 95% HDI includes 0: -0.55 to 2.13. The Effective Sample Size is way above 10,000 for every b0 (around 18,000 on average). Diagnostics look good.
glmer()
Now, this is the equivalent glmer() model:
glmer(response ~ 1 + (1|subject), data = myData, family = "binomial")
The result of this model, however, is as follows:
Random effects:
Groups Name Variance Std.Dev.
speaker (Intercept) 0.3317 0.576
Number of obs: 1544, groups: subject, 27
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.7401 0.1247 5.935 2.94e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So here it says my estimate is significantly above 0.
What the data look like
Here are the proportions of 0s and 1s by subject. You can see that, for the vast majority of subjects, the proportion of 1 is above 50%.
Any ideas why JAGS and glmer() are so different here?
0 1
1 0.47 0.53
2 0.36 0.64
3 0.29 0.71
4 0.42 0.58
5 0.12 0.88
6 0.22 0.78
7 0.54 0.46
8 0.39 0.61
9 0.30 0.70
10 0.32 0.68
11 0.36 0.64
12 0.66 0.34
13 0.38 0.62
14 0.49 0.51
15 0.35 0.65
16 0.32 0.68
17 0.12 0.88
18 0.45 0.55
19 0.36 0.64
20 0.36 0.64
21 0.28 0.72
22 0.40 0.60
23 0.41 0.59
24 0.19 0.81
25 0.27 0.73
26 0.08 0.92
27 0.12 0.88
You forgot to include a mean value, so your intercept parameter is fixed to zero. Something like this should work:
model{
for(i in 1:Ny){
y[i] ~ dbern(mu[s[i]])
}
for(j in 1:Ns){
mu[j] <- ilogit(b0[j])
b0[j] ~ dnorm(mu0, sigma)
}
mu0 ~ dnorm(0,0.001)
sigma ~ dunif(0, 100)
}
Now the posterior density of mu0 should match the sampling distribution of the intercept parameter from glmer reasonably well.
Alternatively, if you use response ~ -1 + (1|subject) as your glmer formula, you should get results that match your current JAGS model.
I am doing a factor analysis via the psych-package, which generates quite some output fa8<-fa(corMat, nfactors=8, ...)
The output includes some matrices and some text information. However, I did not find a good way of saving the matrices of the output to a file. So far, I was able to dump the data via sink("foo.txt"); f8; sink() as complete output. Neither write(fa8) nor write.csv(fa8) work, because the class of the output is a vector - it does not contain the matrix data itself, though.
Any suggestions on how I can get the fa-matrix itself for further analysis and saving it to a file?
update #1:
An examplatory output of fa(corMat, nfactors=2, ...)would be
Factor Analysis using method = pa
Call: fa(r = corMat, nfactors = 2, rotate = "oblimin", fm = "pa")
Standardized loadings based upon correlation matrix
PA1 PA2 h2 u2
BIO 0.86 0.02 0.75 0.255
GEO 0.78 0.05 0.63 0.369
CHEM 0.87 -0.05 0.75 0.253
ALG -0.04 0.81 0.65 0.354
CALC 0.01 0.96 0.92 0.081
STAT 0.13 0.50 0.29 0.709
PA1 PA2
SS loadings 2.14 1.84
Proportion Var 0.36 0.31
Cumulative Var 0.36 0.66
With factor correlations of
PA1 PA2
PA1 1.00 0.21
PA2 0.21 1.00
Test of the hypothesis that 2 factors are sufficient.
The degrees of freedom for the null model are 15 and the objective function was 2.87
The degrees of freedom for the model are 4 and the objective function was 0.01
The root mean square of the residuals is 0.01
The df corrected root mean square of the residuals is 0.02
Fit based upon off diagonal values = 1
Measures of factor score adequacy
PA1 PA2
Correlation of scores with factors 0.94 0.96
Multiple R square of scores with factors 0.88 0.93
Minimum correlation of possible factor scores 0.77 0.86
Source: http://rtutorialseries.blogspot.de/2011/10/r-tutorial-series-exploratory-factor.html
The question is: How do I get the standardized loadings matrix in the output for further analysis?