R package multgee -- initial values - r

I'm working on estimating a generalized estimating equation in R. I have a multinomial (ordinal) outcome, and so I have been attempting to use the package multgee since, as far as I know, packages like geepack or gee don't allow for the estimation of multinomial outcomes.
However, I'm running into some issues. The documentation seems good, but in particular, it seems to be requiring initial (starting) values in the model. If I try to run the model without it, I get a line requesting starting values. Here's the model:
formula <- PAINAD_recode~Age + Gender + group_2part + Cornell + stim_intensity_scale
fitmod <- ordLORgee(formula,data=data,bstart=c(1,0,1,0,1),id=data$subject,
repeated=data$trial)
I just threw in some ones and zeroes for the starting values there. However, when I enter starting values (even plausible ones), it claims that:
Starting values and parameters vector differ in length
I thought that with five predictors, I would need five starting values. I can't find more information about this particular matrix. Does anyone have any thoughts on this? The outcome here has five levels (ordinal) and the repeated component has 20 levels. ANy suggestions would be appreciated.

Related

Negative binomial model with multiply imputed, weighted dataset in R

I am running an analysis of hospital length of stay based on a number of parameters in R, with one primary exposure. Many of the covariates are missing, most commonly lab values, because these aren't checked for all patients. I've worked out what seems to be a good multiple imputation schema using MICE. Because of imbalance between exposed and unexposed groups, I'm also weighting using propensity scores.
I've managed to run a successful weighted Poisson model with MICE and WeightThem. However, when I checked the models for overdispersion, it does appear that the variance is greater than the mean, implying I should be using a quasipoisson or negative binomial model. However, I can't find documentation on negative binomial models with WeightThem or WeightIt in R.
Does anyone have any experience? To run a negative binomial model, i can just use the following code:
results <- with(models, MASS::glm.nb(LOS ~ exposure + covariate1 + covariate2)
in which "models" is the multiply-imputed WeightIt object.
However, according to the WeightIt documentation, when using any glm model you need to run it as a svyglm to get proper standard errors:
results <- with(models, svyglm(LOS ~ exposure + covariate1 + covariate2,
family = poisson()))
There is a function in the sjstats package called svyglm.nb, but this requires creating a design matrix or the model won't run. I have no idea how/whether this is necessary - is the first version (just glm.nb) sufficient? Am I entirely thinking about this wrong?
Thanks so much, advice is much appreciated.

Regression model with missing data in dependant variable

modelo <- lm( P3J_IOP~ PräOP_IOP +OPTyp + P3J_Med, data = na.omit(df))
summary(modelo)
Error:
Fehler in step(modelo, direction = "backward") :
Number of lines used has changed: remove missing values?
I have a lot of missing values in my dependent variable P3J_IOP.
Has anyone any idea how to create the model?
tl;dr unfortunately, this is going to be hard.
It is fairly difficult to make linear regression work smoothly with missing values in the predictors/dependent variables (this is true of most statistical modeling approaches, with the exception of random forests). In case it's not clear, the problem with stepwise approaches with missing data in the predictor is:
incomplete cases (i.e., observations with missing data for any of the current set of predictors) must be dropped in order to fit a linear model;
models with different predictor sets will typically have different sets of incomplete cases, leading to the models being fitted on different subsets of the data;
models fitted to different data sets aren't easily comparable.
You basically have the following choices:
drop any predictors with large numbers of missing values, then drop all cases that have missing values in any of the remaining predictors;
use some form of imputation, e.g. with the mice package, to fill in your missing data (in order to do proper statistical inference, you need to do multiple imputation, which may be hard to combine with stepwise regression).
There are some advanced statistical techniques that will allow you to simultaneously do the imputation and the modeling, such as the brms package (here is some documentation on imputation with brms), but it's a pretty big hammer/jump in statistical sophistication if all you want to do is fit a linear model to your data ...

Use glm to predict on fresh data

I'm relatively new to glm - so please bear with me.
I have created a glm (logistic regression) to predict whether an individual CONTINUES studies ("0") or does NOTCONTINUE ("1"). I am interested in predicting the latter. The glm uses seven factors in the dataset and the confusion matrices are very good for what I need and combining seven years' of data have also been done. Straight-forward.
However, I now need to apply the model to the current years' data, which of course does not have the NOTCONTINUE column in it. Lets say the glm model is "CombinedYears" and the new data is "Data2020"
How can I use the glm model to get predictions of who will ("0") or will NOT ("1") continue their studies? Do I need to insert a NOTCONTINUE column into the latest file ?? I have tried this structure
Predict2020 <- predict(CombinedYears, data.frame(Data2020), type = 'response')
but the output only holds values <0.5.
Any help very gratefully appreciated. Thank you in advance
You mentioned that you already created a prediction model to predict whether a particular student will continue studies or not. You used the glm package and your model name is CombinedYears.
Now, what you have to know is that your problem is a binary classification and you used logistic regression for this. The output of your model when you apply it on new data, or even the same data used to fit the model, is probabilities. These are values between zero and one. In the development phase of your model, you need to determine the cutoff threshold of these probabilities which you can use later on when you predict new data. For example, you may determine 0.5 as a cutoff, and every probability above that is considered NOTCONTINUE and below that is CONTINUE. However, the best threshold can be determined from your data as well by maximizing both specificity and sensitivity. This can be done by calculating the area under the receiver operating characteristic curve (AUC). There are many packages than can do this for you, such as pROC and AUC packages in R. The same packages can determine the best cutoff as well.
What you have to do is the following:
Determine the cutoff threshold after calculating the AUC
library(pROC)
roc_object = roc(your_fit_data$NOTCONTINUE ~ fitted(CombinedYears))
coords(roc.roc_object, "best", ret="threshold", transpose = FALSE)
Use your model to predict on your new data year (as you did)
Predict2020 = predict(CombinedYears, data.frame(Data2020), type = 'response')
Now, the content of Predict2020 is just probabilities for each
student. Use the cutoff you obtained from step (1) to classify your
students accordingly

R: glm (multiple linear regression) ignores/removes some predictor variables

I have posted this question before, but I believe that I had not explained the problem well and that it was over-complicated, so I deleted my previous post and I am posting this one instead. I am sorry if this caused any inconvenience.
I also apologize in advance for not being able to provide example data, I am using very large tables, and what I am trying to do works fine with simpler examples, so providing example data cannot help. It has always worked for me until now. So I am just trying to get your ideas on what might be the issue. But if there is any way I could provide more information, do let me know.
So, I have a vector corresponding to a response variable and a table of predictor variables. The response vector is numeric, the predictor variables (columns of the table) are in the binary format (0s and 1s).
I am running the glm function (multivariate linear regression) using the response vector and the table of predictors:
fit <- glm(response ~ as.matrix(predictors), na.action=na.exclude)
coeff <- as.vector(coef(summary(fit))[,4])[-1]
When I have been doing that in the past, I would extract the vector of regression coefficient to use it for further analysis.
The problem is that now the regression returns a vector of coefficients which is missing some values. Essentially some predictor variables are not attributed a coefficient at all by glm. But there are no error messages.
The summary of the model looks normal, but some predictor variables are missing like I mentioned. Most other predictors have assigned data (coefficient, pvalue, etc.).
About 30 predictors are missing from the model, over 200.
I have tried using different response variables (vectors), but I am getting the same issue, although the missing predictors vary depending on the response vector...
Any ideas on what might be going on? I think this can happen if some variables have 0 variance, but I have checked that. There are also no NA values and no missing values in the tables.
What could cause glm to ignore/remove some predictor variables?
Any suggestion is welcome!
EDIT: I found out that the predictors that were removed has values identical to another predictor. There should still be a way to keep them, and they would get the same regression coefficient for example
Your edit explains why you are not getting those variables. That was going to be my first question. (This question would be better posed on Cross validated because it is not an R error, it is a problem with your model.)
They would not get the same coefficients: Say you have a 1:1 relationship, Y = X + e, Then fit simple model Y ~ X + X. Each X is going to be assigned ANY value such that the sum is equal to 1. There is no solution. Y = 0.5X + 0.5X may be the most obvious to us, but Y = 100X -99X is just as valid.
You also cannot have any predictors that are linear sums of other predictors for the same reason.
If you really want those values you can generate them from what you have. However I do not recommend it because the assumptions are going to be on very thin ice.

Pseudo R squared for cumulative link function

I have an ordinal dependent variable and trying to use a number of independent variables to predict it. I use R. The function I use is clm in the ordinal package, to perform a cumulative link function with a probit link, to be precise:
I tried the function pR2 in the package pscl to get the pseudo R squared with no success.
How do I get pseudo R squareds with the clm function?
Thanks so much for your help.
There are a variety of pseudo-R^2. I don't like to use any of them because I do not see the results as having a meaning in the real world. They do not estimate effect sizes of any sort and they are not particularly good for statistical inference. Furthermore in situations like this with multiple observations per entity, I think it is debatable which value for "n" (the number of subjects) or the degrees of freedom is appropriate. Some people use McFadden's R^2 which would be relatively easy to calculate, since clm generated a list with one of its values named "logLik". You just need to know that the logLikelihood is only a multiplicative constant (-2) away from the deviance. If one had the model in the first example:
library(ordinal)
data(wine)
fm1 <- clm(rating ~ temp * contact, data = wine)
fm0 <- clm(rating ~ 1, data = wine)
( McF.pR2 <- 1 - fm1$logLik/fm0$logLik )
[1] 0.1668244
I had seen this question on CrossValidated and was hoping to see the more statistically sophisticated participants over there take this one on, but they saw it as a programming question and dumped it over here. Perhaps their opinion of R^2 as a worthwhile measure is as low as mine?
Recommend to use function nagelkerke from rcompanion package to get Pseudo r-squared.
When your predictor or outcome variables are categorical or ordinal, the R-Squared will typically be lower than with truly numeric data. R-squared merely a very weak indicator about model's fit, and you can't choose model based on this.

Resources