Multinomial/Ordinal hypotheses tests in SAS/R - r

I'm trying to recreate some SAS output in R. I'm doing ordinal/multinomial regression using the polr and multinom functions from the MASS and nnet packages respectively.
The output I want to recreate in R from SAS is the test of the global null via LRT, Score, and Wald tests, as well as the type 3 analysis of effects, i.e. basically the test of the interaction (all interaction terms tested together) and of the main effects. I tried to use the wald.test function from the aod package but it was giving me errors about L and V not being conformable arrays, though I made sure L was a matrix of the same size as the matrix of coefficients entered into the function for the b = argument.
Lastly, is there a quick way to test the proportional odds assumption in R?
Any help/guidance is appreciated. Thanks!
Some example data:
educ <- runif(21483, min = 0, max = 20)
df <- cbind(gss_cat[, c("marital", "race")], educ)
model <- multinom(marital ~ race*educ, data = df)
Basically what I'm trying to reproduce from SAS are the following command lines:
proc logistic data=in desc;
class race /param=ref;
model marital = educ race educ*race /link=glogit;
output out=predicted predprobs=individual;
run;

Related

Most important variables for finding group membership

I have a dataset 8100 observations of 118 variables that are used to determine which one of 4 groups each respondent falls into. I am interested in which variables are the most important for predicting group membership. My data is a combination of ordinal and binary. I initially did a discriminant function analysis, but then read that this does not handle binary data well. Next I tried a multinomial logistic regression. However, from here I am struggling to work out which variables are the most important. I had tried an r-part decision tree, but then I read that these are not very stable, and indeed, when I ran it on a random half of my data I got different results every time. Now I am trying a dominance analysis. I can get it working for a linear model (lm), but for both the multinomial logistic regression and the discriminant function analysis I get the error:
Error in daRawResults(x = x, constants = constants, terms = terms, fit.functions = fit.functions, :
Not implemented method to retrieve data from model
Does anyone have any advice for what else I can try? Only 4 of the 118 variables are binary, so I can remove them if needed and will still have a good analysis.
Here is a reproducible example including a much smaller example dataset:
set.seed(1) ## for reproducibility
remotes::install_github("clbustos/dominanceAnalysis") # If you don't have the dominance analysis package
library(dominanceanalysis)
library(MASS)
library(nnet)
mydata <- data.frame(Segments=sample(1:4, 15, replace=TRUE),
var1=sample(1:7, 15, replace=TRUE),
var2=sample(1:7, 15, replace=TRUE),
var3=sample(1:6, 15, replace=TRUE),
var4=sample(1:2, 15, replace=TRUE))
# Show that it works for a linar model
LM<-lm(Segments ~., mydata)
da.LM<-dominanceAnalysis(LM);da.LM
#var1 is the most important, followed by var4
# Try the discriminant function analysis
DFA <- lda(Segments~., data=mydata)
da.DFA <- dominanceAnalysis(DFA)
# Error
# Try multinomial logistic regression
MLR <- multinom(Segments ~ ., data = mydata, maxit=500)
da.MLR <- dominanceAnalysis(MLR)
# Error
I've discovered a partial answer.
The dominanceanalysis package can only be used on these models: Ordinary Least Squares, Generalized Linear Models, Dynamic Linear Models and Hierarchical Linear Models.
Source: https://github.com/clbustos/dominanceAnalysis
This explains why it didn't work for my data - I wasn't using those models.
I have decided to pursue the decision tree option of variable selection by using a random forest.

Obtaining predictions from a pooled imputation model

I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)

One-way ANOVA using weighted survey data in R

I have some survey data with sample weights, and I'm using the survey package in R to compare means between demographic groups. I've had no problems using svyttest for two-sample t-tests involving dichotomous independent variables (e.g., sex).
However, I'm having some issues with the anova.svglym function in the survey package. When the generalized linear model has two independent variables, it works fine. But when the GLM has only one polytomous independent variable, I get a NULL result.
Here's some example code comparing BMI to demographics to describe my problem
library(survey)
# Survey design
healthdsgn = svydesign(data = healthsvy, id = ~SVYPSU, strata = ~SVYSTRATA, weights = ~SVYWT, nest = TRUE)
# This has two independent variables
# Outputs a normal ANOVA table
race_sex_glm = svyglm(BMI~RACE+SEX, healthdsgn)
anova(race_sex_glm)
# This has one independent polytomous variable
# Does not output a normal ANOVA table. Returns NULL
race_glm = svyglm(BMI~RACE, healthdsgn)
anova(race_glm)
This problem doesn't occur when I use the aov function in R — both of these produce normal ANOVA output:
summary(aov(BMI~RACE+SEX, healthsvy))
summary(aov(BMI~RACE, healthsvy))
However, aov doesn't account for survey weights so the output would be incorrect.
What should I do to resolve this?

Wald Chi-Squared Test between two variable in one mixed effect model in r

I tried to finish a homework in longitudinal data analysis.
the question is to compare the difference in the cross-sectional and longitudinal effects of age (baseline cross-sectional:baseage, longtitudinal age: agechange) within a model.
the model I code like:
fit<-lme(logfev1~baseage+agechange+height,random = ~1|id,correlation=corAR1(form=~visit|id),logfev1)
In Stata we just need to code like : test baseage=agechange, then the answer will shows:
test baseage = change_age
[logfev1]baseage - [logfev1]change_age = 0
chi2( 1) = 0.41
Prob > chi2 = 0.5244
but in R I really don't know how to do the test (wald test).
If you use the glmer instead of the lme method you can use summary(fit) and it will actually give you the Wald Test as a part of its output.
Or you can call the Anova(fit) from the car package on your lme and it will return a Wald's Chi Square result.
You can type ?glmer in the console of rStudio to read about it and if you install the car package and load it you can run ?Anova (captalized) to get the low down on the method there

Logistic Regression Using R

I am running logistic regressions using R right now, but I cannot seem to get many useful model fit statistics. I am looking for metrics similar to SAS:
http://www.ats.ucla.edu/stat/sas/output/sas_logit_output.htm
Does anyone know how (or what packages) I can use to extract these stats?
Thanks
Here's a Poisson regression example:
## from ?glm:
d.AD <- data.frame(counts=c(18,17,15,20,10,20,25,13,12),
outcome=gl(3,1,9),
treatment=gl(3,3))
glm.D93 <- glm(counts ~ outcome + treatment,data = d.AD, family=poisson())
Now define a function to fit an intercept-only model with the same response, family, etc., compute summary statistics, and combine them into a table (matrix). The formula .~1 in the update command below means "refit the model with the same response variable [denoted by the dot on the LHS of the tilde] but with only an intercept term [denoted by the 1 on the RHS of the tilde]"
glmsumfun <- function(model) {
glm0 <- update(model,.~1) ## refit with intercept only
## apply built-in logLik (log-likelihood), AIC,
## BIC (Bayesian/Schwarz Information Criterion) functions
## to models with and without intercept ('model' and 'glm0');
## combine the results in a two-column matrix with appropriate
## row and column names
matrix(c(logLik(glm.D93),BIC(glm.D93),AIC(glm.D93),
logLik(glm0),BIC(glm0),AIC(glm0)),ncol=2,
dimnames=list(c("logLik","SC","AIC"),c("full","intercept_only")))
}
Now apply the function:
glmsumfun(glm.D93)
The results:
full intercept_only
logLik -23.38066 -26.10681
SC 57.74744 54.41085
AIC 56.76132 54.21362
EDIT:
anova(glm.D93,test="Chisq") gives a sequential analysis of deviance table containing df, deviance (=-2 log likelihood), residual df, residual deviance, and the likelihood ratio test (chi-squared test) p-value.
drop1(glm.D93) gives a table with the AIC values (df, deviances, etc.) for each single-term deletion; drop1(glm.D93,test="Chisq") additionally gives the LRT test p value.
Certainly glm with a family="binomial" argument is the function most commonly used for logistic regression. The default handling of contrasts of factors is different. R uses treatment contrasts and SAS (I think) uses sum contrasts. You can look these technical issues up on R-help. They have been discussed many, many times over the last ten+ years.
I see Greg Snow mentioned lrm in 'rms'. It has the advantage of being supported by several other functions in the 'rms' suite of methods. I would use it , too, but learning the rms package may take some additional time. I didn't see an option that would create SAS-like output.
If you want to compare the packages on similar problems that UCLA StatComputing pages have another resource: http://www.ats.ucla.edu/stat/r/dae/default.htm , where a large number of methods are exemplified in SPSS, SAS, Stata and R.
Using the lrm function in the rms package may give you the output that you are looking for.

Resources