Getting around with predictors stacked with the intercept - r

my factor "Hours" is a categorical predictor and has values 1 and 2. When I applied as.factor, I think the category of value 1 is stacked with the intercept. Is there a way for me to not make that stacking happen?
Call:
glm(formula = Appointment.Status ~ as.factor(Hours), family = binomial,
data = data_appt)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.5593 -0.5593 -0.5593 -0.4781 2.1098
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.11132 0.04523 -46.681 < 2e-16 ***
as.factor(Hours)2 0.33508 0.05435 6.166 7.02e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10871 on 13970 degrees of freedom
Residual deviance: 10832 on 13969 degrees of freedom
AIC: 10836
Number of Fisher Scoring iterations: 4

Related

Extracting selected output in R using summary

Extracting selected output in R using summary
model <- glm(am ~ disp + hp, data=mtcars, family=binomial)
T1<-summary(model)
T1
This is the output i get
Call:
glm(formula = am ~ disp + hp, family = binomial, data = mtcars)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9665 -0.3090 -0.0017 0.3934 1.3682
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.40342 1.36757 1.026 0.3048
disp -0.09518 0.04800 -1.983 0.0474 *
hp 0.12170 0.06777 1.796 0.0725 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 16.713 on 29 degrees of freedom
AIC: 22.713
Number of Fisher Scoring iterations: 8
I want to extract only the coefficients and null deviance as shown below how do I do it, I tried using $coefficeint but it only shows coefficient values ?
Coefficients:
(Intercept) disp hp
1.40342203 -0.09517972 0.12170173
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 16.713 on 29 degrees of freedom
AIC: 22.713
Update:
Try this:
coef(model)
model$coefficients
model$null.deviance
model$deviance
model$aic
If you type in T1$ then a window occurs and you can select whatever you need.
T1$null.deviance
T1$coefficients
> T1$null.deviance
[1] 43.22973
> T1$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.40342203 1.36756660 1.026218 0.30478864
disp -0.09517972 0.04800283 -1.982794 0.04739044
hp 0.12170173 0.06777320 1.795721 0.07253897

Small sample (20-25 observations) - Robust standard errors (Newey-West) do not change coefficients/standard errors. Is this normal?

I am running a simple regression (OLS)
> lm_1 <- lm(Dependent_variable_1 ~ Independent_variable_1, data = data_1)
> summary(lm_1)
Call:
lm(formula = Dependent_variable_1 ~ Independent_variable_1,
data = data_1)
Residuals:
Min 1Q Median 3Q Max
-143187 -34084 -4990 37524 136293
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 330853 13016 25.418 < 2e-16 ***
`GDP YoY% - Base` 3164631 689599 4.589 0.000118 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 66160 on 24 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.4674, Adjusted R-squared: 0.4452
F-statistic: 21.06 on 1 and 24 DF, p-value: 0.0001181
The autocorrelation and heteroskedasticity tests follow:
> dwtest(lm_1,alternative="two.sided")
Durbin-Watson test
data: lm_1
DW = 0.93914, p-value = 0.001591
alternative hypothesis: true autocorrelation is not 0
> bptest(lm_1)
studentized Breusch-Pagan test
data: lm_1
BP = 9.261, df = 1, p-value = 0.002341
then I run a robust regression for autocorrelation and heteroskedasticity (HAC - Newey-West):
> coeftest(lm_1, vocv=NeweyWest(lm_1,lag=2, prewhite=FALSE))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 330853 13016 25.4185 < 2.2e-16 ***
Independent_variable_1 3164631 689599 4.5891 0.0001181 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
and I get the same results for coefficients / standard errors.
Is this normal? Is this due to the small sample size?

Model outcome = mortality (count), exposure = climate (continuous), Rstudio

I have ran Poisson distribution model with quasi Poisson error in Rstudio
glm(formula = MI ~ corr_data$Temperature + corr_data$Humidity +
corr_data$Sun + corr_data$Rain, family = quasipoisson(),
data = corr_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.5323 -1.1149 -0.1346 0.8591 3.2585
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.9494713 1.2068332 3.273 0.00144 **
corr_data$Temperature -0.0281248 0.0144238 -1.950 0.05381 .
corr_data$Humidity -0.0099800 0.0144047 -0.693 0.48992
corr_data$Sun -0.0767811 0.0414464 -1.853 0.06670 .
corr_data$Rain -0.0003076 0.0004211 -0.731 0.46662
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 1.873611)
Null deviance: 249.16 on 111 degrees of freedom
Residual deviance: 206.36 on 107 degrees of freedom
(24 observations deleted due to missingness)
I have read that the dispersion parameter should be ideally close to 1
I have some zero values in my cumulative rainfall measures.
How best to I go about finding the appropriate model?
I next tried negative binomial
Call:
glm.nb(formula = Incidence ~ Humidity + Sun + Rain, data = corr_data,
init.theta = 22.16822882, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.53274 -0.85380 -0.08705 0.73230 2.48643
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3626266 1.0970701 1.242 0.2142
Humidity 0.0111537 0.0124768 0.894 0.3713
Sun -0.0295395 0.0345469 -0.855 0.3925
Rain -0.0006170 0.0003007 -2.052 0.0402 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(22.1682) family taken to be 1)
Null deviance: 120.09 on 111 degrees of freedom
Residual deviance: 113.57 on 108 degrees of freedom
(24 observations deleted due to missingness)
AIC: 578.3
Number of Fisher Scoring iterations: 1
Theta: 22.2
Std. Err.: 11.8
2 x log-likelihood: -568.299
Any advice would be very much appreciated. I am new to R and to modelling!

How does lm() know which predictors are categorical?

Normally, me and you(assuming you're not a bot) are easily able to identify whether a predictor is categorical or quantitative. Like, for example, gender is obviously categorical. Your last vote can be classified categorically.
Basically, we can identify categorical predictors easily. But what happens when we input some data in R, and it's lm function makes dummy variables for a predictor? How does it do that?
Somewhat related Question on StackOverflow.
Search R factor function. Here is a small demo, first model uses number of cylinder as a numerical valuable. Second model uses it as a categorical variable.
> summary(lm(mpg~cyl,mtcars))
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
> summary(lm(mpg~factor(cyl),mtcars))
Call:
lm(formula = mpg ~ factor(cyl), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.2636 -1.8357 0.0286 1.3893 7.2364
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
factor(cyl)6 -6.9208 1.5583 -4.441 0.000119 ***
factor(cyl)8 -11.5636 1.2986 -8.905 8.57e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09
Hxd1011 adressed the more difficult case, when a categorical variable is stored as a number and therefore R understands by default that it is a numerical value - and if this is not the desired behaviour we must use factor function.
Your example with predictor ShelveLoc in dataset Carseats is easier because it's a text (character) variable, and therefore it can only be a categorical variable.
> head(Carseats$ShelveLoc)
[1] Bad Good Medium Medium Bad Bad
Levels: Bad Good Medium
R decides that thing from the features type. You can check that by using the str(dataset).If the feature is factor type then it will create dummies for that feature.

BigO of Algorithm Using Multiple Variable Regression

For more verbose algorithms, determining the time complexity (i.e. BigO) is a pain. My solution has been to time the execution of the algorithm with parameters n and k, and come up with a function (time function) that varies with n and k.
My data looks something like the below:
n k executionTime
500 1 0.02
500 2 0.03
500 3 0.05
500 ... ...
500 10 0.18
1000 1 0.08
... ... ...
10000 1 9.8
... ... ...
10000 10 74.57
I've been using the lm() function in the stats R package. I don't know how to interpret the output of the multiple regression, to determine a final Big-O. This is my main question: how do you translate the output of a multiple variable regression, to a final ruling on the best Big-O time complexity rating?
Here's the output of the lm():
Residuals:
Min 1Q Median 3Q Max
-14.943 -5.325 -1.916 3.681 31.475
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.130e+01 1.591e+00 -13.39 <2e-16 ***
n 4.080e-03 1.953e-04 20.89 <2e-16 ***
k 2.361e+00 1.960e-01 12.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.962 on 197 degrees of freedom
Multiple R-squared: 0.747, Adjusted R-squared: 0.7444
F-statistic: 290.8 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of log(y) ~ log(n) + log(k):
Residuals:
Min 1Q Median 3Q Max
-0.4445 -0.1136 -0.0253 0.1370 0.5007
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.80405 0.13749 -122.22 <2e-16 ***
log(n) 2.02321 0.01609 125.72 <2e-16 ***
log(k) 1.01216 0.01833 55.22 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1803 on 197 degrees of freedom
Multiple R-squared: 0.9897, Adjusted R-squared: 0.9896
F-statistic: 9428 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of the principle components, showing both n and k are contributing to the spread of the multivariate model:
PC1(This is n) PC2 (this is k) PC3 (noise?)
Standard deviation 1.3654 1.0000 0.36840
Proportion of Variance 0.6214 0.3333 0.04524
Cumulative Proportion 0.6214 0.9548 1.00000

Resources