What is the difference between binomial, binomial() and 'binomial' when using glm. They are not identical, as can be see by following code:
> library(MASS)
> bwdf = birthwt[-10]
> mod = glm(low~., data=bwdf, family=binomial)
> mod2 = glm(low~., data=bwdf, family=binomial())
> mod3 = glm(low~., data=bwdf, family="binomial")
> identical(mod, mod2)
[1] FALSE
> identical(mod3, mod2)
[1] FALSE
> identical(mod3, mod)
[1] FALSE
But the values are identical:
> mod
Call: glm(formula = low ~ ., family = binomial, data = bwdf)
Coefficients:
(Intercept) age lwt race2 race3 smoke1 ptl ht1 ui1 ftv
0.48062 -0.02955 -0.01542 1.27226 0.88050 0.93885 0.54334 1.86330 0.76765 0.06530
Degrees of Freedom: 188 Total (i.e. Null); 179 Residual
Null Deviance: 234.7
Residual Deviance: 201.3 AIC: 221.3
>
> mod2
Call: glm(formula = low ~ ., family = binomial(), data = bwdf)
Coefficients:
(Intercept) age lwt race2 race3 smoke1 ptl ht1 ui1 ftv
0.48062 -0.02955 -0.01542 1.27226 0.88050 0.93885 0.54334 1.86330 0.76765 0.06530
Degrees of Freedom: 188 Total (i.e. Null); 179 Residual
Null Deviance: 234.7
Residual Deviance: 201.3 AIC: 221.3
>
> mod3
Call: glm(formula = low ~ ., family = "binomial", data = bwdf)
Coefficients:
(Intercept) age lwt race2 race3 smoke1 ptl ht1 ui1 ftv
0.48062 -0.02955 -0.01542 1.27226 0.88050 0.93885 0.54334 1.86330 0.76765 0.06530
Degrees of Freedom: 188 Total (i.e. Null); 179 Residual
Null Deviance: 234.7
Residual Deviance: 201.3 AIC: 221.3
Is there any difference?
Remember that the identical function is very picky and that part of your mod objects is the call that was used to create the object. That call piece will differ based on the parentheses and quotes and so identical will say that they differ. Try calling identical on the pieces of the mod objects that you care about and see if they are identical.
If you look at the first few lines of the code of glm you will see that it checks the family argument and if it is a character string, then it uses get to "get" the function of that name. If family is a function (either passed in, or as a result of get) then it calls the function. So whether you pass in the name as a character string, the function, or the results of evaluating the function, after the 1st part of the code you will have the exact same thing in family and therefore the same results (but the call will be different).
Related
I am working on propensity score weighting and I want to use logistic regression to estimate the ATEs.
There are 3 covariates (Institution.Cluster, Education and SuicidalAttempt_yn) and 2 variables(clozapine.or.not and death_yn).
The data of Institution Cluster are string, e.g., AKE ABE AIE AOE
For Education: Primary, Secondary, tertiary
Suicidal Attempt: 1 or 0
Clozapine or not: 1 or 0
death_yn: 1 or 0
Since I want to compare the mortality of both clozapine and control group. I am not sure if I should do the analysis separately or combined.
However, I still try to combine both clozapine and control group together.
Therefore, I used group_by(clozapine.or.not) to generate the result.
models <- df_all %>%
group_by(clozapine.or.not) %>%
do(model = glm(death_yn ~
Education +
Institution.Cluster +
SuicidalAttempt_yn ,
data = df_all ,
family = binomial(logit)))
models$model
The output:
[[1]]
Call: glm(formula = death_yn ~ Education + Institution.Cluster + SuicidalAttempt_yn,
family = binomial(logit), data = df_all)
Coefficients:
(Intercept) EducationLess than Primary EducationPrimary EducationSecondary EducationTertiary or above EducationTertiary or above
-0.99253 -0.02816 -0.40882 -1.21096 -1.61048 -11.58372
EducationUnknown Institution.ClusterHKW Institution.ClusterKC Institution.ClusterKE Institution.ClusterKW Institution.ClusterNTE
-0.52788 0.07147 0.11661 -0.18822 0.41416 -0.11384
Institution.ClusterNTW Institution.ClusterUnknown SuicidalAttempt_yn1
0.06603 -1.78597 0.14611
Degrees of Freedom: 6220 Total (i.e. Null); 6206 Residual
(83 observations deleted due to missingness)
Null Deviance: 5227
Residual Deviance: 5031 AIC: 5061
[[2]]
Call: glm(formula = death_yn ~ Education + Institution.Cluster + SuicidalAttempt_yn,
family = binomial(logit), data = df_all)
Coefficients:
(Intercept) EducationLess than Primary EducationPrimary EducationSecondary EducationTertiary or above EducationTertiary or above
-0.99253 -0.02816 -0.40882 -1.21096 -1.61048 -11.58372
EducationUnknown Institution.ClusterHKW Institution.ClusterKC Institution.ClusterKE Institution.ClusterKW Institution.ClusterNTE
-0.52788 0.07147 0.11661 -0.18822 0.41416 -0.11384
Institution.ClusterNTW Institution.ClusterUnknown SuicidalAttempt_yn1
0.06603 -1.78597 0.14611
Degrees of Freedom: 6220 Total (i.e. Null); 6206 Residual
(83 observations deleted due to missingness)
Null Deviance: 5227
Residual Deviance: 5031 AIC: 5061
Does it mean I have already distributed them into clozapine group and control group?
In the following example
df <- data.frame(place = c("South","South","North"),
temperature = c(30,30,20),
outlookfine=c(TRUE,TRUE,FALSE)
)
glm.fit <- glm(outlookfine ~ .,df, family=binomial)
glm.fit
The output is
Call: glm(formula = outlookfine ~ ., family = binomial, data = df)
Coefficients:
(Intercept) placeSouth temperature
-23.57 47.13 NA
Degrees of Freedom: 2 Total (i.e. Null); 1 Residual
Null Deviance: 3.819
Residual Deviance: 3.496e-10 AIC: 4
Why is temperature NA?
[Update]
I experimented with more data
df <- data.frame(place = c("South","South","North","East","West"),
temperature = c(30,17,20,12,15),
outlookfine=c(TRUE,TRUE,FALSE,FALSE,TRUE)
)
glm.fit <- glm(outlookfine ~ .,df, family= binomial )
glm.fit
This time there was an output
Call: glm(formula = outlookfine ~ ., family = binomial, data = df)
Coefficients:
(Intercept) placeNorth placeSouth placeWest temperature
-2.457e+01 -7.094e-07 4.913e+01 4.913e+01 8.868e-08
Degrees of Freedom: 4 Total (i.e. Null); 0 Residual
Null Deviance: 6.73
Residual Deviance: 2.143e-10 AIC: 10
I think it is because place is highly correlated with temperature.
You'll get the same fitted(glm.fit) values if you either do
glm.fit <- glm(outlookfine ~ place,df, family=binomial)
or
glm.fit <- glm(outlookfine ~ temperature, df, family=binomial)
Another example with correlated variables giving NA coefficients.
df <- iris
df$SL <- df$Sepal.Length * 2 + 1
glm(Sepal.Width ~ Sepal.Length + SL, data = df)
Call: glm(formula = Sepal.Width ~ Sepal.Length + SL, data = df)
Coefficients:
(Intercept) Sepal.Length SL
3.41895 -0.06188 NA
Degrees of Freedom: 149 Total (i.e. Null); 148 Residual
Null Deviance: 28.31
Residual Deviance: 27.92 AIC: 179.5
When I run the following
df <- data.frame(place = c("South","South","North"),
temperature = c(30,30,20),
outlookfine=c(TRUE,TRUE,FALSE)
)
glm.fit <- glm(outlookfine ~ .,df, family=binomial)
glm.fit
The output is
Call: glm(formula = outlookfine ~ ., family = binomial, data = df)
Coefficients:
(Intercept) placeSouth temperature
-23.57 47.13 NA
Degrees of Freedom: 2 Total (i.e. Null); 1 Residual
Null Deviance: 3.819
Residual Deviance: 3.496e-10 AIC: 4
Why is North missing?
[Update]
I added "East" and now North appears.
How does R Choose which is the base case?
I am checking out the docs
I'm executing the following program glm(y~x, family=poisson(link=log)). I can't understand the difference between residuals(XX) and XX$residuals. I'd like to know where residuals(XX) and XX$residuals come from, and the relationship to deviance. Please give me some advice.
x<-c(1,2,3,4)
y<-c(2,3,7,6)
r<-glm(y~x,family=poisson(link="log"))
#Call: glm(formula = y ~ x, family = poisson(link = "log"))
#Coefficients:
#(Intercept) x
# 0.4978 0.3691
#Degrees of Freedom: 3 Total (i.e. Null); 2 Residual
#Null Deviance: 3.961
#Residual Deviance: 1.064 AIC: 18.13
deviance(r)# [1] 1.063829
sum(residuals(r)^2)# [1] 1.063829
residuals(r)# -0.2530074 -0.2434844 0.8533358 -0.4608144
sum(r$residuals^2)# [1] 0.234646
r$residuals# -0.1594713 -0.1283418 0.4061298 -0.1667391
Thanks. I got it.
residuals(r)=residuals(x,type="deviance")
residuals(r)# -0.2530074 -0.2434844 0.8533358 -0.4608144
residuals(r,type="deviance")# -0.2530074 -0.2434844 0.8533358 -0.4608144
METHOD
mu<-r$fitted# 2.379455 3.441716 4.978203 7.200626
poisson.dev<-function(y,mu) 2*(y*log(ifelse(y==0,1,y/mu))-(y-mu))
sqrt(poisson.dev(y,r$fitted))*ifelse(r$fitted<y,1,-1)# -0.2530074 -0.2434844 0.8533358
RELATIONSHIP_deviance(r)=sum(residuals(r)^2)=sum(residuals(r,type="deviance")^2
deviance(r)# [1] 1.063829
sum(residuals(r)^2)# [1] 1.063829
r$residuals=residuals(x,type="working")
r$residuals#-0.1594713 -0.1283418 0.4061298 -0.1667391
residuals(r,type="working")#-0.1594713 -0.1283418 0.4061298 -0.1667391
(y-r$fitted)/r$fitted#-0.1594713 -0.1283418 0.4061298 -0.1667391
I am running a logistic regression and I am noticing that each unique character string in my vector is receiving its own parameter. Is R optimizing the prediction on the outcome variable based each collection of unique values within the vector?
library(stats)
df = as.data.frame( matrix(c("a","a","b","c","c","b","a","a","b","b","c",1,0,0,0,1,0,1,1,0,1,0,1,0,100,10,8,3,5,6,13,10,4,"SF","CHI","NY","NY","SF","SF","CHI","CHI","SF","CHI","NY"), ncol = 4))
colnames(df) = c("letter","number1","number2","city")
df$letter = as.factor(df$letter)
df$city = as.factor(df$city)
df$number1 = as.numeric(df$number1)
df$number2 = as.numeric(df$number2)
glm(number1 ~ .,data=df)
#Call: glm(formula = number1 ~ ., data = df)
#Coefficients:
# (Intercept) letterb letterc number2 cityNY citySF
#1.57191 -0.25227 -0.01424 0.04593 -0.69269 -0.20634
#Degrees of Freedom: 10 Total (i.e. Null); 5 Residual
#Null Deviance: 2.727
#Residual Deviance: 1.35 AIC: 22.14
How is the logit treating city in the example above?