I am trying to determine if the hospital I work at should open a new unit for admissions. I intend to do this by correlating patient assigned unit and length of stay in days.
So far, I have determined that the continuous dependent variable (length of stay in days) should be log transformed, as well as the continuous age variable. Unit is categorical and contains 7 categories (unit 65, unit 66, unit 67, unit 75, unit 76, unit 77, unit 94). Gender is standard 0=male and 1=female.
My regression equation is as follows:
log_glm <- glm(log(days)~ gender + log(age) + unit, family=gaussian, data=data2022)
Model Summary:
glm(formula = log(days) ~ gender + log(age) + unit, family = gaussian,
data = data2022)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.3606 -0.6712 0.0011 0.7146 2.8338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7416 0.9994 0.742 0.459064
gender 0.3257 0.2082 1.564 0.119566
log(age) 1.3809 0.2353 5.870 2.17e-08 ***
unitunit_66 0.1329 0.2994 0.444 0.657717
unitunit_67 1.2518 0.3334 3.755 0.000237 ***
unitunit_75 -1.8108 0.3067 -5.903 1.84e-08 ***
unitunit_76 0.2797 0.3120 0.896 0.371355
unitunit_77 -0.3540 0.3178 -1.114 0.266803
unitunit_94 0.1876 0.5350 0.351 0.726339
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 1.145772)
Null deviance: 429.95 on 181 degrees of freedom
Residual deviance: 198.22 on 173 degrees of freedom
(14 observations deleted due to missingness)
AIC: 552.03
Number of Fisher Scoring iterations: 2
All units at this hospital house patients with different needs. Is it appropriate to "compare" each unit in the summary output to the reference unit when all units are unique? Is this interpretation correct:
Unit 67 is statistically significant at the 0.05 level. The coefficient is 1.2518, so being assigned to unit 67 is associated with a length of stay in days 249.6631% ((exp(1.2518)-1100) higher on average than those assigned to unit 65.*
Doubting myself because that result seemed too high and interpretation incorrect, I tried a model with dummy variables for unit:
log_glm_age <- glm(log(days)~ gender + log(age) + unit_65 + unit_66 + unit_67 + unit_75 + unit_76 + unit_77 + unit_94, family=gaussian, data=data2022)
With output:
glm(formula = log(days) ~ gender + log(age) + unit_65 + unit_66 +
unit_67 + unit_75 + unit_76 + unit_77 + unit_94, family = gaussian,
data = data2022)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.3606 -0.6712 0.0011 0.7146 2.8338
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.92919 1.10243 0.843 0.400475
gender 0.32568 0.20819 1.564 0.119566
log(age) 1.38094 0.23525 5.870 2.17e-08 ***
unit_65 -0.18755 0.53500 -0.351 0.726339
unit_66 -0.05465 0.52964 -0.103 0.917936
unit_67 1.06427 0.55764 1.909 0.057979 .
unit_75 -1.99832 0.53701 -3.721 0.000268 ***
unit_76 0.09213 0.54694 0.168 0.866435
unit_77 -0.54160 0.52780 -1.026 0.306252
unit_94 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 1.145772)
Null deviance: 429.95 on 181 degrees of freedom
Residual deviance: 198.22 on 173 degrees of freedom
(14 observations deleted due to missingness)
AIC: 552.03
Number of Fisher Scoring iterations: 2
But I am having doubts interpreting this. Since each patient is only admitted to one unit, is it appropriate to have all units in the model?
I don't know if it's because I am the only analyst in the entire hospital and can't discuss this with anyone who understands or if I'm having a weird day, but I can't figure out what is correct right now and could use some guidance.
Related
I've run an Interrupted Time Series Analysis using a GLM and need to be able to exponentiate outcomes in order to validate. I have been recommended the emmeans package, but I'm not quite sure how to do it.
Base R summary is below:
summary(fit1a)
Call:
glm(formula = `Subject Total` ~ Quarter + int2 + time_since_intervention2,
family = "poisson", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4769 -0.5111 0.1240 0.6103 0.9128
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.54584 0.09396 37.737 <0.0000000000000002 ***
Quarter -0.02348 0.01018 -2.306 0.0211 *
int2 -0.23652 0.21356 -1.108 0.2681
time_since_intervention2 -0.02624 0.04112 -0.638 0.5234
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 63.602 on 23 degrees of freedom
Residual deviance: 13.368 on 20 degrees of freedom
AIC: 140.54
Number of Fisher Scoring iterations: 4
Cant really understand how to get started with emmeans. How would I code in order to get the estimate for Quarter, int2 and time_since_intervention2 on the response scale?
Apologies for any bad English, it is not my first language :)
So I have a dataset of the passengers of the titanic, and produced the following fit summary:
glm(formula = Survived ~ factor(Pclass) + Age + I(Age^2) + Sex +
Fare + I(Fare^2), family = binomial(), data = titan)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7298 -0.6738 -0.3769 0.6291 2.4821
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.678e+00 6.321e-01 7.401 1.35e-13 ***
factor(Pclass)2 -1.543e+00 3.525e-01 -4.377 1.20e-05 ***
factor(Pclass)3 -2.909e+00 3.882e-01 -7.494 6.69e-14 ***
Age -6.813e-02 2.196e-02 -3.102 0.00192 **
I(Age^2) 4.620e-04 3.193e-04 1.447 0.14792
Sexmale -2.595e+00 2.131e-01 -12.177 < 2e-16 ***
Fare -9.800e-03 5.925e-03 -1.654 0.09815 .
I(Fare^2) 2.798e-05 1.720e-05 1.627 0.10373
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 964.52 on 713 degrees of freedom
Residual deviance: 641.74 on 706 degrees of freedom
(177 observations deleted due to missingness)
AIC: 657.74
Number of Fisher Scoring iterations: 5
Now I'm trying to predict the survival probability of a female aged 21 who paid 35 for her ticket fare.
I'm unable to use predict or predict.glm and am unsure why. I run the following and produce this error:
predict(glmfit, data.frame(PClass=2, Sex="female", Age=20), type="response")
Error in factor(Pclass) : object 'Pclass' not found
I then just try to calculate it the long-way, that is by multiplying my coefficients to the desired values but the answer I get there is not right either.
(4.678e+00)+(1*-1.543e+00)+(21*-6.813e-02)+((21^2)*4.620e-04)+(35*-9.800e-03)+((35^2)*2.798e-05)
[1] 1.599287
Not sure how I could a probability greater than 1, especially when my response is a binomial factor of 0 or 1.
Could someone please shed some light on my mistakes? Thanks in advance.
If you want to calculate the probability by hand, then follow the steps
Multiply coefficients to the desired values
Take exponential of the output from step 1
Probability = output of step 2/(1 + output of step 2)
In your case, the output of step 1 is 1.599287. The output of step 2 will be exp(1.599287) = 4.949502. Then probability = 4.949502/(1 + 4.949502) = 0.8319187.
So, in R you can create your own function like
logit2prob <- function(logit){
odds <- exp(logit)
prob <- odds / (1 + odds)
return(prob)
}
For more details, you can visit this.
Otherwise, the suggestion by #Roland should work fine.
This is my glm model which I was able to create but I now want to do the Hosmer Lemeshow GOF test but get this error which I don't understand-
Call:
glm(formula = BC.result ~ Diabetic + Low.diastolic + Pulse, family =
"binomial", data = Data2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0805 -0.4559 -0.3144 -0.2437 2.8259
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.86311 0.98715 -5.939 2.86e-09 ***
Diabetic 1.21963 0.37395 3.262 0.001108 **
Low.diastolic 1.27095 0.35074 3.624 0.000291 ***
Pulse 0.02361 0.00780 3.027 0.002470 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 276.40 on 485 degrees of freedom
Residual deviance: 249.71 on 482 degrees of freedom
(14 observations deleted due to missingness)
AIC: 257.71
> hl<- hoslem.test(model3$BC.result, fitted(model3, g=10))
Error in model.frame.default(formula = cbind(y0 = 1 - y, y1 = y) ~ cutyhat)
:
variable lengths differ (found for 'cutyhat')
Can anyone explain this error with the Hosmer Lemeshow goodness of fit test?
Thanks
I have some data that I've fit a glm from the family quasibinomial. The model shows high significance for a number of different factors. However, I was wondering how can I get effect sizes for these different factors. Does anyone have any idea?
The model results:
Call:
glm(formula = freq ~ K + res_dist + K:trail_cost + trail_cost:res_dist:K,
family = "quasibinomial", data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2072 -0.3505 -0.1406 0.2714 1.7746
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.222e+00 1.786e-01 6.842 7.24e-11 ***
K -1.741e-05 2.107e-06 -8.265 1.19e-14 ***
res_distrandom -1.419e+00 2.386e-01 -5.949 1.02e-08 ***
K:trail_cost 1.381e-01 3.930e-02 3.515 0.000531 ***
K:res_distrandom:trail_cost 1.564e-01 2.532e-02 6.176 3.03e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 0.2932451)
Null deviance: 126.028 on 230 degrees of freedom
Residual deviance: 66.616 on 226 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
The full model for growth of plants is as follows:
lmer(log(growth) ~ nutrition + fertilizer + season + (1|block)
where nutrition(nitrogen/phosphorus), fertilizer(none/added), season(dry/wet)
The summary of the model is as follows:
REML criterion at convergence: 71.9
Scaled residuals:
Min 1Q Median 3Q Max
-1.82579 -0.59620 0.04897 0.62629 1.54639
Random effects:
Groups Name Variance Std.Dev.
block (Intercept) 0.06008 0.2451
Residual 0.48633 0.6974
Number of obs: 32, groups: tank, 16
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 3.5522 0.2684 19.6610 13.233 3.02e-11 ***
nutritionP 0.2871 0.2753 13.0000 1.043 0.31601
fertlizeradded -0.3513 0.2753 13.0000 -1.276 0.22436
seasonwet 1.0026 0.2466 15.0000 4.066 0.00101 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Plant growth here is only dependent on season, and the increase in growth is 1.0026 on the log scale. How do I interpret this on the scale of the original data, if I want to what the increase in actual plant height was? Is it only e(1.0026) ~ 3 cms, or is there any other way to interpret this?
exp(1.0026) is indeed about 3 (2.72), but this value represents proportional change. Growth is three times higher in the wet than in the dry season, all other things being equal.