Model outcome = mortality (count), exposure = climate (continuous), Rstudio - r

I have ran Poisson distribution model with quasi Poisson error in Rstudio
glm(formula = MI ~ corr_data$Temperature + corr_data$Humidity +
corr_data$Sun + corr_data$Rain, family = quasipoisson(),
data = corr_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.5323 -1.1149 -0.1346 0.8591 3.2585
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.9494713 1.2068332 3.273 0.00144 **
corr_data$Temperature -0.0281248 0.0144238 -1.950 0.05381 .
corr_data$Humidity -0.0099800 0.0144047 -0.693 0.48992
corr_data$Sun -0.0767811 0.0414464 -1.853 0.06670 .
corr_data$Rain -0.0003076 0.0004211 -0.731 0.46662
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 1.873611)
Null deviance: 249.16 on 111 degrees of freedom
Residual deviance: 206.36 on 107 degrees of freedom
(24 observations deleted due to missingness)
I have read that the dispersion parameter should be ideally close to 1
I have some zero values in my cumulative rainfall measures.
How best to I go about finding the appropriate model?
I next tried negative binomial
Call:
glm.nb(formula = Incidence ~ Humidity + Sun + Rain, data = corr_data,
init.theta = 22.16822882, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.53274 -0.85380 -0.08705 0.73230 2.48643
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3626266 1.0970701 1.242 0.2142
Humidity 0.0111537 0.0124768 0.894 0.3713
Sun -0.0295395 0.0345469 -0.855 0.3925
Rain -0.0006170 0.0003007 -2.052 0.0402 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(22.1682) family taken to be 1)
Null deviance: 120.09 on 111 degrees of freedom
Residual deviance: 113.57 on 108 degrees of freedom
(24 observations deleted due to missingness)
AIC: 578.3
Number of Fisher Scoring iterations: 1
Theta: 22.2
Std. Err.: 11.8
2 x log-likelihood: -568.299
Any advice would be very much appreciated. I am new to R and to modelling!

Related

Extracting selected output in R using summary

Extracting selected output in R using summary
model <- glm(am ~ disp + hp, data=mtcars, family=binomial)
T1<-summary(model)
T1
This is the output i get
Call:
glm(formula = am ~ disp + hp, family = binomial, data = mtcars)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9665 -0.3090 -0.0017 0.3934 1.3682
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.40342 1.36757 1.026 0.3048
disp -0.09518 0.04800 -1.983 0.0474 *
hp 0.12170 0.06777 1.796 0.0725 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 16.713 on 29 degrees of freedom
AIC: 22.713
Number of Fisher Scoring iterations: 8
I want to extract only the coefficients and null deviance as shown below how do I do it, I tried using $coefficeint but it only shows coefficient values ?
Coefficients:
(Intercept) disp hp
1.40342203 -0.09517972 0.12170173
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 16.713 on 29 degrees of freedom
AIC: 22.713
Update:
Try this:
coef(model)
model$coefficients
model$null.deviance
model$deviance
model$aic
If you type in T1$ then a window occurs and you can select whatever you need.
T1$null.deviance
T1$coefficients
> T1$null.deviance
[1] 43.22973
> T1$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.40342203 1.36756660 1.026218 0.30478864
disp -0.09517972 0.04800283 -1.982794 0.04739044
hp 0.12170173 0.06777320 1.795721 0.07253897

Small sample (20-25 observations) - Robust standard errors (Newey-West) do not change coefficients/standard errors. Is this normal?

I am running a simple regression (OLS)
> lm_1 <- lm(Dependent_variable_1 ~ Independent_variable_1, data = data_1)
> summary(lm_1)
Call:
lm(formula = Dependent_variable_1 ~ Independent_variable_1,
data = data_1)
Residuals:
Min 1Q Median 3Q Max
-143187 -34084 -4990 37524 136293
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 330853 13016 25.418 < 2e-16 ***
`GDP YoY% - Base` 3164631 689599 4.589 0.000118 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 66160 on 24 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.4674, Adjusted R-squared: 0.4452
F-statistic: 21.06 on 1 and 24 DF, p-value: 0.0001181
The autocorrelation and heteroskedasticity tests follow:
> dwtest(lm_1,alternative="two.sided")
Durbin-Watson test
data: lm_1
DW = 0.93914, p-value = 0.001591
alternative hypothesis: true autocorrelation is not 0
> bptest(lm_1)
studentized Breusch-Pagan test
data: lm_1
BP = 9.261, df = 1, p-value = 0.002341
then I run a robust regression for autocorrelation and heteroskedasticity (HAC - Newey-West):
> coeftest(lm_1, vocv=NeweyWest(lm_1,lag=2, prewhite=FALSE))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 330853 13016 25.4185 < 2.2e-16 ***
Independent_variable_1 3164631 689599 4.5891 0.0001181 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
and I get the same results for coefficients / standard errors.
Is this normal? Is this due to the small sample size?

BigO of Algorithm Using Multiple Variable Regression

For more verbose algorithms, determining the time complexity (i.e. BigO) is a pain. My solution has been to time the execution of the algorithm with parameters n and k, and come up with a function (time function) that varies with n and k.
My data looks something like the below:
n k executionTime
500 1 0.02
500 2 0.03
500 3 0.05
500 ... ...
500 10 0.18
1000 1 0.08
... ... ...
10000 1 9.8
... ... ...
10000 10 74.57
I've been using the lm() function in the stats R package. I don't know how to interpret the output of the multiple regression, to determine a final Big-O. This is my main question: how do you translate the output of a multiple variable regression, to a final ruling on the best Big-O time complexity rating?
Here's the output of the lm():
Residuals:
Min 1Q Median 3Q Max
-14.943 -5.325 -1.916 3.681 31.475
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.130e+01 1.591e+00 -13.39 <2e-16 ***
n 4.080e-03 1.953e-04 20.89 <2e-16 ***
k 2.361e+00 1.960e-01 12.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.962 on 197 degrees of freedom
Multiple R-squared: 0.747, Adjusted R-squared: 0.7444
F-statistic: 290.8 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of log(y) ~ log(n) + log(k):
Residuals:
Min 1Q Median 3Q Max
-0.4445 -0.1136 -0.0253 0.1370 0.5007
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.80405 0.13749 -122.22 <2e-16 ***
log(n) 2.02321 0.01609 125.72 <2e-16 ***
log(k) 1.01216 0.01833 55.22 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1803 on 197 degrees of freedom
Multiple R-squared: 0.9897, Adjusted R-squared: 0.9896
F-statistic: 9428 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of the principle components, showing both n and k are contributing to the spread of the multivariate model:
PC1(This is n) PC2 (this is k) PC3 (noise?)
Standard deviation 1.3654 1.0000 0.36840
Proportion of Variance 0.6214 0.3333 0.04524
Cumulative Proportion 0.6214 0.9548 1.00000

Getting around with predictors stacked with the intercept

my factor "Hours" is a categorical predictor and has values 1 and 2. When I applied as.factor, I think the category of value 1 is stacked with the intercept. Is there a way for me to not make that stacking happen?
Call:
glm(formula = Appointment.Status ~ as.factor(Hours), family = binomial,
data = data_appt)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.5593 -0.5593 -0.5593 -0.4781 2.1098
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.11132 0.04523 -46.681 < 2e-16 ***
as.factor(Hours)2 0.33508 0.05435 6.166 7.02e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10871 on 13970 degrees of freedom
Residual deviance: 10832 on 13969 degrees of freedom
AIC: 10836
Number of Fisher Scoring iterations: 4

How to pull out Dispersion parameter in R

Call:
glm(formula = Y1 ~ 0 + x1 + x2 + x3 + x4 + x5, family = quasibinomial(link = cauchit))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5415 0.2132 0.3988 0.6614 1.8426
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x1 -0.7280 0.3509 -2.075 0.03884 *
x2 -0.9108 0.3491 -2.609 0.00951 **
x3 0.2377 0.1592 1.494 0.13629
x4 -0.2106 0.1573 -1.339 0.18151
x5 3.6982 0.8658 4.271 2.57e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 0.8782731)
Null deviance: 443.61 on 320 degrees of freedom
Residual deviance: 270.17 on 315 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 12
Here is the output from glm in R.
Do you know a way to pull out Dispersion parameter which is 0.8782731 in this case, instead of just copy and paste. Thanks.
You can extract it from the output of summary:
data(iris)
mod <- glm((Petal.Length > 5) ~ Sepal.Width, data=iris)
summary(mod)
#
# Call:
# glm(formula = (Petal.Length > 5) ~ Sepal.Width, data = iris)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -0.3176 -0.2856 -0.2714 0.7073 0.7464
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.38887 0.26220 1.483 0.140
# Sepal.Width -0.03561 0.08491 -0.419 0.676
#
# (Dispersion parameter for gaussian family taken to be 0.2040818)
#
# Null deviance: 30.240 on 149 degrees of freedom
# Residual deviance: 30.204 on 148 degrees of freedom
# AIC: 191.28
#
# Number of Fisher Scoring iterations: 2
summary(mod)$dispersion
# [1] 0.2040818
The str function in R is often helpful to solve these sorts of questions. For instance, I looked at str(summary(mod)) to answer the question.

Resources