I have some survey data with sample weights, and I'm using the survey package in R to compare means between demographic groups. I've had no problems using svyttest for two-sample t-tests involving dichotomous independent variables (e.g., sex).
However, I'm having some issues with the anova.svglym function in the survey package. When the generalized linear model has two independent variables, it works fine. But when the GLM has only one polytomous independent variable, I get a NULL result.
Here's some example code comparing BMI to demographics to describe my problem
library(survey)
# Survey design
healthdsgn = svydesign(data = healthsvy, id = ~SVYPSU, strata = ~SVYSTRATA, weights = ~SVYWT, nest = TRUE)
# This has two independent variables
# Outputs a normal ANOVA table
race_sex_glm = svyglm(BMI~RACE+SEX, healthdsgn)
anova(race_sex_glm)
# This has one independent polytomous variable
# Does not output a normal ANOVA table. Returns NULL
race_glm = svyglm(BMI~RACE, healthdsgn)
anova(race_glm)
This problem doesn't occur when I use the aov function in R — both of these produce normal ANOVA output:
summary(aov(BMI~RACE+SEX, healthsvy))
summary(aov(BMI~RACE, healthsvy))
However, aov doesn't account for survey weights so the output would be incorrect.
What should I do to resolve this?
Related
I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)
I would like to perform post-hoc tests on imputed data using MICE in R.
Typically MICE imputes data which is converted to long data to calculate total scores and can be converted back into MIDS elements. Analysis is then conducted over a MIRO element after which analysis can be pooled.
However, I am not able to get it running for post hoc tests including Tukey and Gammel-Howell. Would someone be able to help?
IMP <- mice(data, m=5, maxit=10)
IMP_long <- data.frame(complete(IMP, include=TRUE, action= 'long')
IMP_mids <- as.mids(IMP_long, where = NULL, .imp='.imp', .id = '.id'
fit <- with(IMP_mids, expr=lm(total_score ~ GroupingVariable))
The grouping variable consists of 3 groups which I would like to compare pairwise. Namely 1vs2, 2vs3 and 1vs3.
summary(pool(fit))
-> this gives comparisons between two groups, relative to the intercept. Similarly by using contrasts before creating model
Someone who knows how to compare the three groups in one analysis with tukey and/or gammel-howell?
Thanks in advance!!
I have a dataset with 283 observation of 60 variables. My outcome variable is dichotomous (Diagnosis) and can be either of two diseases. I am comparing two types of diseases that often show much overlap and i am trying to find the features that can help differentiate these diseases from each other. I understand that LASSO logistic regression is the best solution for this problem, however it can not be run on a incomplete dataset.
So i imputed my missing data with MICE package in R and found that approximately 40 imputations is good for the amount of missing data that i have.
Now i want to perform lasso logistic regression on all my 40 imputed datasets and somehow i am stuck at the part where i need to pool the results of all these 40 datasets.
The with() function from MICE does not work on .glmnet
# Impute database with missing values using MICE package:
imp<-mice(WMT1, m = 40)
#Fit regular logistic regression on imputed data
imp.fit <- glm.mids(Diagnosis~., data=imp,
family = binomial)
# Pool the results of all the 40 imputed datasets:
summary(pool(imp.fit),2)
The above seems to work fine with logistic regression using glm(), but when i try the exact above to perform Lasso regression i get:
# First perform cross validation to find optimal lambda value:
CV <- cv.glmnet(Diagnosis~., data = imp,
family = "binomial", alpha = 1, nlambda = 100)
When i try to perform cross validation I get this error message:
Error in as.data.frame.default(data) :
cannot coerce class ‘"mids"’ to a data.frame
Can somebody help me with this problem?
A thought:
Consider running the analyses on each of the 40 datasets.
Then, storing which variables are selected in each in a matrix.
Then, setting some threshold (e.g., selected in >50% of datasets).
Is there a way to prevent correlated predictor variables above a certain r cut off to be included in the candidate glm models using glmulti?...maybe an argument/implementation that uses a correlation matrix/TRUE FALSE matrix that can be used to subset the variable combinations? This just illustrates how I would normally initiate glmulti and doesnt attemt to subset variables
global.model<-glm(y ~ x1+x2...+x28, data=all.data, family=binomial)
res <- glmulti(y ~ global.model, data=all.data,level=2, crit="aicc")
I have a data set with both continuous and categorical variables. In the end I want to build a logistic regression model to calculate the probability of a response dichotomous variable.
Is it acceptable, or even a good idea, to apply a log linear model to the categorical variables in the model to test their interactions, and then use the indicated interactions as predictors in the logistic model?
Example in R:
Columns in df: CategoricalA, CategoricalB, CategoricalC, CategoricalD, CategoricalE, ContinuousA, ContinuousB, ResponseA
library(MASS)
#Isolate categorical variables in new data frame
catdf <- df[,c("CategoricalA","CategoricalB","CategoricalC", "CategoricalD", "CategoricalE")]
#Create cross table
crosstable <- table(catdf)
#build log-lin model
model <- loglm(formula = ~ CategoricalA * CategoricalB * CategoricalC * CategoricalD * CategoricalE, data = crosstable)
#Use step() to build better model
automodel <- step(object = model, direction = "backward")
Then build a logistic regresion using the output of automodeland the values of ContinuousA and ContinuousB in order to predict ResponseA (which is binary).
My hunch is that this is not ok, but I cant find the answer definitively one way or the other.
Short answer: Yes. You can use any information in the model that will be available in out-of-time or 'production' run of the model. Whether this information is good, powerful, significant, etc. is a different question.
The logic is that a model can have any type of RHS variable, be it categorical, continuous, logical, etc. Furthermore, you can combine RHS variables to create one RHS variable and also apply transformations. The log linear model of categorical is nothing by a transformed linear combination of raw variables (that happen to be categorical). This method would not be violating any particular modeling framework.