Logistic regression for numerical predictor? - r

I'm working on a data set and want to use some of following variables to predict "Operatieduur". All the predictors have been factorized.
LogicFit <- train(Operatieduur ~ Anesthesioloog + Aorta_chirurgie + Benadering +
Chirurg + Operatietype, data = TrainData,
method="glm", family="binomial")
Here I use "train" function from caret package to make a logistic fitting with glm. When I ran this code I got the error message:
1: model fit failed for Resample01: parameter=none Error in eval(family$initialize) : y values must be 0 <= y <= 1
I googled it and found that the reason is that the resopnse "Operatieduur" is a continuous numerical value(it's a duration time). So how should I modify the function to use the predictors(they are all categorical values) to predict a continuous numerical value? Can logistic function do that?

Logistic regression predicts categories, not numerical variables. If you want to predict a continuous numerical variable (even using categorical variables), use normal regression. Depending on the number of categories of your predictor variables, you may want to consider one hot/dummy encoding.

Related

How to implement $\delta$ shift parameter in categorical variable for multiple imputation sensitivity analysis?

Suppose I was given a data as the following.
id1=rep(1:10,2)
trt=c(rep(1,10),rep(0,10))
outcome=rnorm(20)
set.seed(1005)
missing=c()
for(i in outcome){
if(rbinom(1,1,0.8*abs(i/max(abs(outcome))))==1){
missing=c(missing,which(outcome==i))
}
}
missing
trt[missing]=NA
dat=data.frame(id=id1,trt=as.factor(trt),outcome=outcome)
a1=mice::mice(dat,method=c('','logreg',''))
I can use mice to impute the data first and then conduct analysis on a1 where I assume the outcome and id predicts trt by logistic regression. In fact, only outcome predicts trt here. a1$formulas$trt will access the formula for imputation. I want to modify the formula here so that there is a constant offset.
forms_a1=a1$formulas
forms_a1$trt=as.formula(trt~outcome+offset(2))
mice::mice(dat,method=c('','logreg',''),formulas = forms_a1)
However, mice gives the following output.
iter imp variable
1 1 trtError in model.frame.default(formula, data = data, na.action = na.pass) :
variable lengths differ (found for 'offset(2)')
$Q1:$ How do I offset intercept here? I think I could add an extra column as variable to offset and modify formula and predictormatrix. The $\delta$ shift was implemented here(https://stefvanbuuren.name/fimd/sec-sensitivity.html) for continuous variable case. However, it seems that this may change estimated coefficients.
$Q2:$ If I am interested in offsetting a slope say by 10*id in above formula, how would I do so?

Error when using a multilevel regression (lme4)

I want to use a multilevel regression to analyse the effect of some independent variables on a dependent variable and use varying intercept and slope.
My regression includes non-numeric independent variables which I want to use for the varying intercept and slope. The dependent variable is a numeric variable. When using this multilevel regression I get the following error:
Error: number of observations (=88594) <= number of random effects (=337477) for term (1 + x | z); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable
x and z are characters and are correlated with each other and with y. This is the regression I use:
multi_reg1 <- lmer(y~ 1 + x + (1 + x| z), REML = FALSE, data = data_frame1)
Is there a way to fix this problem or is it not possible and I have to use other regression methods?

Having trouble with overfitting in simple R logistic regression

I am a newbie to R and I am trying to perform a logistic regression on a set of clinical data.
My independent variable is AGE, TEMP, WBC, NLR, CRP, PCT, ESR, IL6, and TIME.
My dependent variable is binomial CRKP.
After using glm.fit, I was given this error message:
glm.fit <- glm(CRKP ~ AGE + TEMP + WBC + NLR + CRP + PCT + ESR, data = cv, family = binomial, subset=train)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
I searched up potential problems and used the corrplot function to see if there is multicollinearity that could potentially result in overfitting.
This is what I have as the plot.
Correlation plot shows that my ESR and PCT variable are highly correlated. Similarly, CRP and IL6 are highly correlated. But they are all important clinical indicators I would like to include in the model.
I tried to use the VIF to selectively discard variables, but wouldn't that be biased and also I would have to sacrifice some of my variables of interest.
Does anyone know what I can do? Please help. Thank you!
You have a multicollinearity problem but don't want to drop variables. In this case you can use Partial Least Squares (PLS) or Principal Component Regression (PCR).

How to fit ordered logistic regression using svyglm()?

I am trying to fit an ordered logistic regression glm for weighted data using svyglm() from the survey library:
model <- svyglm(freehms ~ agea, design = wave9_design, family=binomial(link= "logit"))
freehms is numeric ranging 1 to 5 (I've tried setting it as a factor) and agea is numeric too. I have many more variables, but didn't include them here for simplicity.
But for some reason I get the following error message:
"Error in eval(family$initialize) : y values must be 0 <= y <= 1"
I have looked at online examples, tutorials, and I just can't find what I'm doing wrong. I don't understand why Rstudio insists my independent variable be binary when I have specified the link function (logit) to address this very problem.
You want the svyolr() function in the survey pacakge. Or the new svyVGAM package, which does a wide range of ordinal models. svyglm doesn't fit this model because it isn't a generalised linear model.
For example
library(survey)
data(api)
dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2)
dclus2<-update(dclus2, mealcat=as.ordered(cut(meals,c(0,25,50,75,100))))
svyolr(mealcat~avg.ed+mobility+stype, design=dclus2)
library(svyVGAM)
svy_vglm(mealcat~avg.ed+mobility+stype, design=dclus2, family=propodds())

Calculating VIF for ordinal logistic regression & multicollinearity in R

I am running an ordinal regression model. I have 8 explanatory variables, 4 of them categorical ('0' or '1') , 4 of them continuous. Beforehand I want to be sure there's no multicollinearity, so I use the variance inflation factor (vif function from the car package) :
mod1<-polr(Y ~ X1+X2+X3+X4+X5+X6+X7+X8, Hess = T, data=df)
vif(mod1)
but I get a VIF value of 125 for one of the variables, as well as the following warning :
Warning message: In vif.default(mod1) : No intercept: vifs may not be sensible.
However, when I convert my dependent variable to numeric (instead of a factor), and do the same thing with a linear model :
mod2<-lm(Y ~ X1+X2+X3+X4+X5+X6+X7+X8, data=df)
vif(mod2)
This time all the VIF values are below 3, suggesting that there's no multicollinearity.
I am confused about the vif function. How can it return VIFs > 100 for one model and low VIFs for another ? Should I stick with the second result and still do an ordinal model anyway ?
The vif() function uses determinants of the correlation matrix of the parameters (and subsets thereof) to calculate the VIF. In the linear model, this includes just the regression coefficients (excluding the intercept). The vif() function wasn't intended to be used with ordered logit models. So, when it finds the variance-covariance matrix of the parameters, it includes the threshold parameters (i.e., intercepts), which would normally be excluded by the function in a linear model. This is why you get the warning you get - it doesn't know to look for threshold parameters and remove them. Since the VIF is really a function of inter-correlations in the design matrix (which doesn't depend on the dependent variable or the non-linear mapping from the linear predictor into the space of the response variable [i.e., the link function in a glm]), you should get the right answer with your second solution above, using lm() with a numeric version of your dependent variable.

Resources