How to perform logistic regression on not binary variable? - r

I was searching for this answer and I'm really suprised that haven't found it. I just want to peform three level logistic regression in R.
Let's define some artificial data:
set.seed(42)
y <- sample(0:2, 100, replace = T)
x <- rnorm(100)
My variable y is containing three numbers - 0, 1 and 2. So I thought that the simplest way would be just to use:
glm(y ~ x, family = binomial("logit"))
However I got information that y should be in interval [0,1]. Do you know how I can perform this regression ?
Please notice - I know that it's not so straightforward to perform multilevel logistic regression, there are several techniques how to do so e.g. One vs all. But as I was seeking for it, I wasn't able to find any.

Logistic regression as implemented by glm only works for 2 levels of output, not 3.
The message is a little vauge because you can specify the y-variable in logistic regression as 0s and 1s, or as a proportion (between 0 and 1) with a weights argument specifying the number of subjects the proportion is of.
With 3 or more ordered levels in the response you need to use a generalization, one common generalization is proportional odds logistic regression (also goes by other names). The polr function in the MASS package and the lrm function in the rms package (and probably other functions in other packages) fit these types of models, but glm does not.

set.seed(42)
y <- sample(0:2, 100, replace = TRUE)
x <- rnorm(100)
multinomial regression
If you don't want to treat your responses as ordered (i.e., nominal or categorical values):
library(nnet) ## 'recommended' package, i.e. installed by default
multinom(y~x)
Results
# weights: 9 (4 variable)
initial value 109.861229
final value 104.977336
converged
Call:
multinom(formula = y ~ x)
Coefficients:
(Intercept) x
1 -0.001529465 0.29386524
2 -0.649236723 -0.01933747
Residual Deviance: 209.9547
AIC: 217.9547
Or, if your responses are ordered:
ordinal regression
MASS::polr() does proportional-odds logistic regression. (You may also be interested in the ordinal package, which has more features; it can also do multinomial models.)
library(MASS) ## also 'recommended'
polr(ordered(y)~x)
Results
Call:
polr(formula = ordered(y) ~ x)
Coefficients:
x
0.06411137
Intercepts:
0|1 1|2
-0.4102819 1.3218487
Residual Deviance: 212.165
AIC: 218.165

If you read the error message, it offers a hint that you might get success with:
y <- sample(seq(0,1,length=3), 100, replace = T)
And in fact, you do. Now you challenge might be to interpret that in the context of the actual situation in reality (which you have not offered.) You do get a warning, but R warnings are not errors.
You might also look up the topic of polychotomous logistic regression, which is implemented in several variants that might be useful in particular situations. Frank Harrell's book Regression Modeling Strategies has material on such techniques. You may also post further questions on CrossValidated.com if you need help choosing which route to go.

Related

Longitudinal analysis using sampling weigths in R

I have longitudinal data from two surveys and I want to do a pre-post analysis. Normally, I would use survey::svyglm() or svyVGAM::svy_vglm (for multinomial family) to include sampling weights, but these functions don't account for the random effects. On the other hand, lme4::lmer accounts for the repeated measures, but not the sampling weights.
For continuous outcomes, I understand that I can do
w_data_wide <- svydesign(ids = ~1, data = data_wide, weights = data_wide$weight)
svyglm((post-pre) ~ group, w_data_wide)
and get the same estimates that I would get if I could use lmer(outcome ~ group*time + (1|id), data_long) with weights [please correct me if I'm wrong].
However, for categorical variables, I don't know how to do the analyses. WeMix::mix() has a parameter weights, but I'm not sure if it treats them as sampling weights. Still, this function can't support multinomial family.
So, to resume: can you enlighten me on how to do a pre-post test analysis of categorical outcomes with 2 or more levels? Any tips about packages/functions in R and how to use/write them would be appreciated.
I give below some data sets with binomial and multinomial outcomes:
library(data.table)
set.seed(1)
data_long <- data.table(
id=rep(1:5,2),
time=c(rep("Pre",5),rep("Post",5)),
outcome1=sample(c("Yes","No"),10,replace=T),
outcome2=sample(c("Low","Medium","High"),10,replace=T),
outcome3=rnorm(10),
group=rep(sample(c("Man","Woman"),5,replace=T),2),
weight=rep(c(1,0.5,1.5,0.75,1.25),2)
)
data_wide <- dcast(data_long, id~time, value.var = c('outcome1','outcome2','outcome3','group','weight'))[, `:=` (weight_Post = NULL, group_Post = NULL)]
EDIT
As I said below in the comments, I've been using lmer and glmer with variables used to calculate the weights as predictors. It happens that glmer returns a lot of problems (convergence, high eigenvalues...), so I give another look at #ThomasLumley answer in this post and others (https://stat.ethz.ch/pipermail/r-help/2012-June/315529.html | https://stats.stackexchange.com/questions/89204/fitting-multilevel-models-to-complex-survey-data-in-r).
So, my question is now if a can use participants id as clusters in svydesign
library(survey)
w_data_long_cluster <- svydesign(ids = ~id, data = data_long, weights = data_long$weight)
summary(svyglm(factor(outcome1) ~ group*time, w_data_long_cluster, family="quasibinomial"))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.875e+01 1.000e+00 18.746 0.0339 *
groupWoman -1.903e+01 1.536e+00 -12.394 0.0513 .
timePre 5.443e-09 5.443e-09 1.000 0.5000
groupWoman:timePre 2.877e-01 1.143e+00 0.252 0.8431
and still interpret groupWoman:timePre as differences in the average rate of change/improvement in the outcome over time between sex groups, as if I was using mixed models with participants as random effects.
Thank you once again!
A linear model with svyglm does not give the same parameter estimates as lme4::lmer. It does estimate the same parameters as lme4::lmer if the model is correctly specified, though.
Generalised linear models with svyglm or svy_vglm don't estimate the same parameters as lme4::glmer, as you note. However, they do estimate perfectly good regression parameters and if you aren't specifically interested in the variance components or in estimating the realised random effects (BLUPs) I would recommend just using svy_glm.
Another option if you have non-survey software for random effects versions of the models is to use that. If you scale the weights to sum to the sample size and if all the clustering in the design is modelled by random effects in the model, you will get at least a reasonable approximation to valid inference. That's what I've seen recommended for Bayesian survey modelling, for example.

Why do heteroscedasticity-robust standard errors in logistic regression?

I am following a course on R. At the moment, we are working with logistic regression. The basic form we are taught is this one:
model <- glm(
formula = y ~ x1 + x2,
data = df,
family = quasibinomial(link = "logit"),
weights = weight
)
This makes perfectly sense to me. However, then we are being recommended to use the following to get coefficients and heteroscedasticity-robust inference:
model_rob <- lmtest::coeftest(model, sandwich::vcovHC(model))
This confuses me bit. Reading about vcovHC is states that it creates a "heteroskedasticity-consistent estimation". Why would you do this when doing logistic regression? I taught it did not assume homoscedasticity? Also, I am not sure what the coeftest does?
Thank you!
You're right - homoscedasticity (residuals at each level of the predictor have the same variance), is not an assumption in logistic regression. However, the binary response in logistic regression is heteroscedastic (0 or 1) which is why a corresponding estimator should be consistent with it. I guess that is what is meant with "heteroscedasticity-consistent". As #MrFlick already pointed out, if you would like more information on that topic, Cross Validated is likely to be the place to be. The coeftest produces the Wald test statistic of the estimated coefficients. These tests give you some information on whether a predictor (independent variable) seems to be associated to the dependent variable according to your data.

mgcv: Error Model has more coefficients than data, related to the argument by in the gam()

In case a, the gam code in mgcv R package is working well.
library(mgcv)
dat <- gamSim(1,n=400,dist="normal",scale=2)
num_knots = nrow(dat)
fit <- gam(y~s(x0, bs = "cr", k = num_knots, m=2),data=dat)
summary(fit)
But after I added the argument by in the gam(), it reported the error "Model has more coefficients than data".
fit <- gam(y~s(x0, bs = "cr", k = num_knots, m=2, by = x1),data=dat)
The error confuses me because I thought adding the by argument to create the interaction between the smoothing term and the parametric term should not increase the number of unknown coefficients, though it turns out that I am wrong. Where was I wrong?
When you pass a continuous variable to by, what you are getting is varying coefficient model where the effect of x1 varies as a smooth function of x0.
What is happening in the first case is that because of identifiability constraints being applied to the basis expansion for x0, you requested num_knots basis functions but actually got num_knots - 1 basis functions. When you add the intercept you get num_knots coefficients which is OK to fit with this model as it is a penalised spline (though you probably want method = 'REML'). The identifiability constraint is applied because there is a basis function (or combination) that is confounded with the model intercept and you can't fit two constant terms in the model and have them be uniquely identified.
In the second case, the varying coefficient model, there isn't the same issue, so when you ask for num_knots basis functions plus an intercept you are trying to fit a model with 401 coefficients with 400 observations which isn't going to work.

OLS estimation with AR(1) term

For reasons that I cannot explain (because I can't, not because I don't want to), a process used at my office requires running some regressions on Eviews.
The equation specification used on Eviews is:
dependent_variable c independent_variable ar(1)
Furthermore, the process used is "NLS and ARMA."
I don't use Eviews but, as I understand it, that equation means an OLS regression with a constant, one independent variable and an AR(1) term.
I tried running this in R:
result <- lm(df$dependent[2:48] ~ df$independent[1:47] + df$dependent[1:47])
Where df is a data.frame containing the dependent and independent variables (both spanning 48 observations).
Am I doing it right? Because the parameter estimations, while similar, are different in Eviews. Different enough that I cannot use them.
I've thoroughly searched the internet for what this means. I've read up on ARIMA and ARMAX models but I don't think that this is it. I'm sorry but I'm not that knowledgeable on statistics. By the way, estimating ARMAX models seems very complicated and is done by ML, not LS, so I'm really hoping that's not it.
EDIT: I had to edit the model indexes again because I messed them up, again.
You need arima function, see ?arima
Example with some data
y <- lh # lh is Luteinizing Hormone in Blood Samples in datasets package (Base)
set.seed(001)
x <- rnorm(length(y), 100, 10)
arima(y, order = c(1,0,0), xreg=x)
Call:
arima(x = y, order = c(1, 0, 0), xreg = x)
Coefficients:
ar1 intercept x
0.5810 1.8821 0.0053
s.e. 0.1153 0.6991 0.0068
sigma^2 estimated as 0.195: log likelihood = -29.08, aic = 66.16
See ?arima to find help about its arguments.

Logistic Regression Using R

I am running logistic regressions using R right now, but I cannot seem to get many useful model fit statistics. I am looking for metrics similar to SAS:
http://www.ats.ucla.edu/stat/sas/output/sas_logit_output.htm
Does anyone know how (or what packages) I can use to extract these stats?
Thanks
Here's a Poisson regression example:
## from ?glm:
d.AD <- data.frame(counts=c(18,17,15,20,10,20,25,13,12),
outcome=gl(3,1,9),
treatment=gl(3,3))
glm.D93 <- glm(counts ~ outcome + treatment,data = d.AD, family=poisson())
Now define a function to fit an intercept-only model with the same response, family, etc., compute summary statistics, and combine them into a table (matrix). The formula .~1 in the update command below means "refit the model with the same response variable [denoted by the dot on the LHS of the tilde] but with only an intercept term [denoted by the 1 on the RHS of the tilde]"
glmsumfun <- function(model) {
glm0 <- update(model,.~1) ## refit with intercept only
## apply built-in logLik (log-likelihood), AIC,
## BIC (Bayesian/Schwarz Information Criterion) functions
## to models with and without intercept ('model' and 'glm0');
## combine the results in a two-column matrix with appropriate
## row and column names
matrix(c(logLik(glm.D93),BIC(glm.D93),AIC(glm.D93),
logLik(glm0),BIC(glm0),AIC(glm0)),ncol=2,
dimnames=list(c("logLik","SC","AIC"),c("full","intercept_only")))
}
Now apply the function:
glmsumfun(glm.D93)
The results:
full intercept_only
logLik -23.38066 -26.10681
SC 57.74744 54.41085
AIC 56.76132 54.21362
EDIT:
anova(glm.D93,test="Chisq") gives a sequential analysis of deviance table containing df, deviance (=-2 log likelihood), residual df, residual deviance, and the likelihood ratio test (chi-squared test) p-value.
drop1(glm.D93) gives a table with the AIC values (df, deviances, etc.) for each single-term deletion; drop1(glm.D93,test="Chisq") additionally gives the LRT test p value.
Certainly glm with a family="binomial" argument is the function most commonly used for logistic regression. The default handling of contrasts of factors is different. R uses treatment contrasts and SAS (I think) uses sum contrasts. You can look these technical issues up on R-help. They have been discussed many, many times over the last ten+ years.
I see Greg Snow mentioned lrm in 'rms'. It has the advantage of being supported by several other functions in the 'rms' suite of methods. I would use it , too, but learning the rms package may take some additional time. I didn't see an option that would create SAS-like output.
If you want to compare the packages on similar problems that UCLA StatComputing pages have another resource: http://www.ats.ucla.edu/stat/r/dae/default.htm , where a large number of methods are exemplified in SPSS, SAS, Stata and R.
Using the lrm function in the rms package may give you the output that you are looking for.

Resources