Apologies for any bad English, it is not my first language :)
So I have a dataset of the passengers of the titanic, and produced the following fit summary:
glm(formula = Survived ~ factor(Pclass) + Age + I(Age^2) + Sex +
Fare + I(Fare^2), family = binomial(), data = titan)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7298 -0.6738 -0.3769 0.6291 2.4821
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.678e+00 6.321e-01 7.401 1.35e-13 ***
factor(Pclass)2 -1.543e+00 3.525e-01 -4.377 1.20e-05 ***
factor(Pclass)3 -2.909e+00 3.882e-01 -7.494 6.69e-14 ***
Age -6.813e-02 2.196e-02 -3.102 0.00192 **
I(Age^2) 4.620e-04 3.193e-04 1.447 0.14792
Sexmale -2.595e+00 2.131e-01 -12.177 < 2e-16 ***
Fare -9.800e-03 5.925e-03 -1.654 0.09815 .
I(Fare^2) 2.798e-05 1.720e-05 1.627 0.10373
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 964.52 on 713 degrees of freedom
Residual deviance: 641.74 on 706 degrees of freedom
(177 observations deleted due to missingness)
AIC: 657.74
Number of Fisher Scoring iterations: 5
Now I'm trying to predict the survival probability of a female aged 21 who paid 35 for her ticket fare.
I'm unable to use predict or predict.glm and am unsure why. I run the following and produce this error:
predict(glmfit, data.frame(PClass=2, Sex="female", Age=20), type="response")
Error in factor(Pclass) : object 'Pclass' not found
I then just try to calculate it the long-way, that is by multiplying my coefficients to the desired values but the answer I get there is not right either.
(4.678e+00)+(1*-1.543e+00)+(21*-6.813e-02)+((21^2)*4.620e-04)+(35*-9.800e-03)+((35^2)*2.798e-05)
[1] 1.599287
Not sure how I could a probability greater than 1, especially when my response is a binomial factor of 0 or 1.
Could someone please shed some light on my mistakes? Thanks in advance.
If you want to calculate the probability by hand, then follow the steps
Multiply coefficients to the desired values
Take exponential of the output from step 1
Probability = output of step 2/(1 + output of step 2)
In your case, the output of step 1 is 1.599287. The output of step 2 will be exp(1.599287) = 4.949502. Then probability = 4.949502/(1 + 4.949502) = 0.8319187.
So, in R you can create your own function like
logit2prob <- function(logit){
odds <- exp(logit)
prob <- odds / (1 + odds)
return(prob)
}
For more details, you can visit this.
Otherwise, the suggestion by #Roland should work fine.
Related
I am learning how to use glms to test hypothesis and to see how variables relate among themselves.
I am trying to see if the variable tick prevalence (Parasitized individuals/Assessed individuals)(dependent variable) is influenced by the number of captured hosts (independent variable).
My data looks like figure 1.(116 observations).
I have read that one way to know which distribution to use is to see which distribution the dependent variable has. So I built a histogram for the TickPrev variable (figure 2).
I got to the conclusion that the binomial negative distribution would be the best option. Before I ran the analysis, I transformed the TickPrevalence variable (it was a proportion, and the glm.nb only works with integers) applying the following codes:
df <- df %>% mutate(TickPrev=TickPrev*100)
df$TickPrev <- as.integer(df$TickPrev)
Then I applied the glm.nb function from the MASS package, and obtained this summary
summary(glm.nb(df$TickPrev~df$Captures, link=log))
Call:
glm.nb(formula = df15$TickPrev ~ df15$Captures, link = log, init.theta = 1.359186218)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.92226 -0.69841 -0.08826 0.44562 1.70405
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.438249 0.125464 27.404 <2e-16 ***
df15$Captures -0.008528 0.004972 -1.715 0.0863 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.3592) family taken to be 1)
Null deviance: 144.76 on 115 degrees of freedom
Residual deviance: 141.90 on 114 degrees of freedom
AIC: 997.58
Number of Fisher Scoring iterations: 1
Theta: 1.359
Std. Err.: 0.197
2 x log-likelihood: -991.584
I know that the p-value indicates that there isn't enough proves to believe that the two variables are related. However, I am not sure if I used the best model to fit the data and how I can know that. Can you please help me? Also, knowing what I show, is there a better way to see if this variables are related?
Thank you very much.
I am running a logistic model with an interaction between a dichotomous and a continuous variable:
aidslogit<-glm(cd4~ AGE+ANTIRET+AGE*ANTIRET, data = aidsdata, family = "binomial")
summary(aidslogit, digits=3)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.870 -1.190 0.771 1.056 1.586
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.599565 0.313241 -1.914 0.0556 .
AGE -0.008340 0.008849 -0.942 0.3459
ANTIRET 1.308591 0.198031 6.608 3.9e-11 ***
AGE:ANTIRET -0.013547 0.005507 -2.460 0.0139 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9832.8 on 7264 degrees of freedom
Residual deviance: 9434.9 on 7261 degrees of freedom
(654 observations deleted due to missingness)
AIC: 9442.9
Number of Fisher Scoring iterations: 4
What I would like to do is calculate the OR and 95% CI for antiretroviral use at different ages, i.e, at age 20, 30, and 40. I guess I could output the covariances and do it by hand, but it seems like there must be a way to do it automatically.
For comparison:
In SAS it would look like this:
proc logistic data=aidsdata descending;
model cd4=antiret age antiret*age;
oddsratio antiret /at (age=20 30 40);
run;
For a regression without interactions, the odds ratio for each coefficient is exp(coef(aidslogit)). These are the odds ratios for a one-unit change in a given variable.
But with an interaction, you need to include both the main effect and the interaction. In this case, AGE is the second coefficient and AGE:ANTIRET is the fourth coefficient, so:
# Odds ratio for a one-unit change in AGE when ANTIRET=0
OR = exp(coef(aidslogit)[2])
# Odds ratio for a one-unit change in AGE when ANTIRET=1
OR = exp(coef(aidslogit)[2] + coef(aidslogit)[4])
To calculate the odds ratio for other AGE differences, multiply the coefficients by that amount. For example, to get the odds ratio for a 10-unit change in AGE:
# Odds ratio for a 10-unit change in AGE when ANTIRET=0
OR = exp(10 * coef(aidslogit)[2])
# Odds ratio for a 10-unit change in AGE when ANTIRET=1
OR = exp(10 * (coef(aidslogit)[2] + coef(aidslogit)[4]))
I'm an undergrad student and am currently struggling with R, i'be been trying to teach myself for weeks but I'm not a natural, so I thought i'd seek some support.
I'm currently trying to analyse the interaction of my variables on recall of a target using logistic regression, as specified by my tutor. I have a 2 (isolate x control condition)by 2 (similarity/difference list type) study, and my dependent variable is binary of recall (yes or no). I've tried to clean my data and run the code,
Call:
glm(formula = Target ~ Condition * List, family = "binomial",
data = pro)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8297 -0.3288 0.6444 0.6876 2.4267
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.4663 0.6405 2.289 0.022061 *
Conditionisolate -1.1097 0.8082 -1.373 0.169727
Listsim -4.3567 1.2107 -3.599 0.000320 ***
Conditionisolate:Listsim 5.3218 1.4231 3.740 0.000184 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 97.736 on 70 degrees of freedom
Residual deviance: 65.869 on 67 degrees of freedom
AIC: 73.869
that's my output^ it completely ignores the difference and control condition, I know i'm doing something wrong and i'm feeling quite exacerbated by it. Can any one help me?
In the model output, R is treating control and difference as the baseline levels of your two variables. The outcome associated with them is wrapped up in the intercept. For other combinations of variable levels, the coefficients show how those differ from that baseline.
Control/Difference: just use the intercept
Control/Similarity: intercept + listsim
Isolate/Difference: intercept + conditionisolate
Isolate/Similarity: intercept + listsim + conditionisolate + conditionisolate:listsim
I am using a Poisson GLM on some dummy data to predict ClaimCounts based on two variables, frequency and Judicial Orientation.
Dummy Data Frame:
data5 <-data.frame(Year=c("2006","2006","2006","2007","2007","2007","2008","2009","2010","2010","2009","2009"),
JudicialOrientation=c("Defense","Plaintiff","Plaintiff","Neutral","Defense","Plaintiff","Defense","Plaintiff","Neutral","Neutral","Plaintiff","Defense"),
Frequency=c(0.0,0.06,.07,.04,.03,.02,0,.1,.09,.08,.11,0),
ClaimCount=c(0,5,10,3,4,0,7,8,15,16,17,12),
Loss = c(100000,100,2500,100000,25000,0,7500,5200, 900,100,0,50),
Exposure=c(10,20,30,1,2,4,3,2,1,54,12,13)
)
Model GLM:
ClaimModel <- glm(ClaimCount~JudicialOrientation+Frequency
,family = poisson(link="log"), offset=log(Exposure), data = data5, na.action=na.pass)
Call:
glm(formula = ClaimCount ~ JudicialOrientation + Frequency, family = poisson(link = "log"),
data = data5, na.action = na.pass, offset = log(Exposure))
Deviance Residuals:
Min 1Q Median 3Q Max
-3.7555 -0.7277 -0.1196 2.6895 7.4768
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3493 0.2125 -1.644 0.1
JudicialOrientationNeutral -3.3343 0.5664 -5.887 3.94e-09 ***
JudicialOrientationPlaintiff -3.4512 0.6337 -5.446 5.15e-08 ***
Frequency 39.8765 6.7255 5.929 3.04e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 149.72 on 11 degrees of freedom
Residual deviance: 111.59 on 8 degrees of freedom
AIC: 159.43
Number of Fisher Scoring iterations: 6
I am using an offset of Exposure as well.
I then want to use this GLM to predict claim counts for the same observations:
data5$ExpClaimCount <- predict(ClaimModel, newdata=data5, type="response")
If I understand correctly then the Poisson glm equation should then be:
ClaimCount = exp(-.3493 + -3.3343*JudicialOrientationNeutral +
-3.4512*JudicialOrientationPlaintiff + 39.8765*Frequency + log(Exposure))
However I tried this manually(In excel =EXP(-0.3493+0+0+LOG(10)) for observation 1 for example) and for some of the observations but did not get the correct answer.
Is my understanding of the GLM equation incorrect?
You are right with the assumption about how predict() for a Poisson GLM works. This can be verified in R:
co <- coef(ClaimModel)
p1 <- with(data5,
exp(log(Exposure) + # offset
co[1] + # intercept
ifelse(as.numeric(JudicialOrientation)>1, # factor term
co[as.numeric(JudicialOrientation)], 0) +
Frequency * co[4])) # linear term
all.equal(p1, predict(ClaimModel, type="response"), check.names=FALSE)
[1] TRUE
As indicated in the comments you probably get the wrong results in Excel because of the different basis of the logarithm (10 in Excel, Euler's number in R).
I have measurements obtained from 2 groups (a and b) where each group has the same 3 levels (x, y, z). The measurements are counts out of totals (i.e., rates), but in group a there cannot be zeros whereas in group b there can (hard coded in the example below).
Here's my example data.frame:
set.seed(3)
df <- data.frame(count = c(rpois(15,5),rpois(15,5),rpois(15,3),
rpois(15,7.5),rpois(15,2.5),rep(0,15)),
group = as.factor(c(rep("a",45),rep("b",45))),
level = as.factor(rep(c(rep("x",15),rep("y",15),rep("z",15)),2)))
#add total - fixed for all
df$total <- rep(max(df$count)*2,nrow(df))
I'm interested in quantifying for each level x,y,z if there is any difference between the (average) measurements of a and b? If there is, is it statistically significant?
From what I understand a Poisson GLM for rates seems to be appropriate for these types of data. In my case it seems that perhaps a negative binomial GLM would be even more appropriate since my data are over dispersed (I tried to create that in my example data to some extent but in my real data it is definitely the case).
Following the answer I got for a previous post I went with:
library(dplyr)
library(MASS)
df %>%
mutate(interactions = paste0(group,":",level),
interactions = ifelse(group=="a","a",interactions)) -> df2
df2$interactions = as.factor(df2$interactions)
fit <- glm.nb(count ~ interactions + offset(log(total)), data = df2)
> summary(fit)
Call:
glm.nb(formula = count ~ interactions + offset(log(total)), data = df2,
init.theta = 41.48656798, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.40686 -0.75495 -0.00009 0.46892 2.28720
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.02047 0.07824 -25.822 < 2e-16 ***
interactionsb:x 0.59336 0.13034 4.552 5.3e-06 ***
interactionsb:y -0.28211 0.17306 -1.630 0.103
interactionsb:z -20.68331 2433.94201 -0.008 0.993
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(41.4866) family taken to be 1)
Null deviance: 218.340 on 89 degrees of freedom
Residual deviance: 74.379 on 86 degrees of freedom
AIC: 330.23
Number of Fisher Scoring iterations: 1
Theta: 41.5
Std. Err.: 64.6
2 x log-likelihood: -320.233
I'd expect the difference between a and b for level z to be significant. However, the Std. Error for level z seems enormous and hence the p-value is nearly 1.
My question is whether the model I'm using is set up correctly to answer my question (mainly through the use of the interactions factor?)