Binary Logistic Regression output - r

I'm an undergrad student and am currently struggling with R, i'be been trying to teach myself for weeks but I'm not a natural, so I thought i'd seek some support.
I'm currently trying to analyse the interaction of my variables on recall of a target using logistic regression, as specified by my tutor. I have a 2 (isolate x control condition)by 2 (similarity/difference list type) study, and my dependent variable is binary of recall (yes or no). I've tried to clean my data and run the code,
Call:
glm(formula = Target ~ Condition * List, family = "binomial",
data = pro)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8297 -0.3288 0.6444 0.6876 2.4267
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.4663 0.6405 2.289 0.022061 *
Conditionisolate -1.1097 0.8082 -1.373 0.169727
Listsim -4.3567 1.2107 -3.599 0.000320 ***
Conditionisolate:Listsim 5.3218 1.4231 3.740 0.000184 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 97.736 on 70 degrees of freedom
Residual deviance: 65.869 on 67 degrees of freedom
AIC: 73.869
that's my output^ it completely ignores the difference and control condition, I know i'm doing something wrong and i'm feeling quite exacerbated by it. Can any one help me?

In the model output, R is treating control and difference as the baseline levels of your two variables. The outcome associated with them is wrapped up in the intercept. For other combinations of variable levels, the coefficients show how those differ from that baseline.
Control/Difference: just use the intercept
Control/Similarity: intercept + listsim
Isolate/Difference: intercept + conditionisolate
Isolate/Similarity: intercept + listsim + conditionisolate + conditionisolate:listsim

Related

Binomial negative distribution

I am learning how to use glms to test hypothesis and to see how variables relate among themselves.
I am trying to see if the variable tick prevalence (Parasitized individuals/Assessed individuals)(dependent variable) is influenced by the number of captured hosts (independent variable).
My data looks like figure 1.(116 observations).
I have read that one way to know which distribution to use is to see which distribution the dependent variable has. So I built a histogram for the TickPrev variable (figure 2).
I got to the conclusion that the binomial negative distribution would be the best option. Before I ran the analysis, I transformed the TickPrevalence variable (it was a proportion, and the glm.nb only works with integers) applying the following codes:
df <- df %>% mutate(TickPrev=TickPrev*100)
df$TickPrev <- as.integer(df$TickPrev)
Then I applied the glm.nb function from the MASS package, and obtained this summary
summary(glm.nb(df$TickPrev~df$Captures, link=log))
Call:
glm.nb(formula = df15$TickPrev ~ df15$Captures, link = log, init.theta = 1.359186218)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.92226 -0.69841 -0.08826 0.44562 1.70405
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.438249 0.125464 27.404 <2e-16 ***
df15$Captures -0.008528 0.004972 -1.715 0.0863 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.3592) family taken to be 1)
Null deviance: 144.76 on 115 degrees of freedom
Residual deviance: 141.90 on 114 degrees of freedom
AIC: 997.58
Number of Fisher Scoring iterations: 1
Theta: 1.359
Std. Err.: 0.197
2 x log-likelihood: -991.584
I know that the p-value indicates that there isn't enough proves to believe that the two variables are related. However, I am not sure if I used the best model to fit the data and how I can know that. Can you please help me? Also, knowing what I show, is there a better way to see if this variables are related?
Thank you very much.

How can I calculate the variance of coefficient in linear model using R? [duplicate]

Simple question really! I am running lots of linear regressions of y~x and want to obtain the variance for each regression without computing it from hand from the Standard Error output given in the summary.lm command. Just to save a bit of time :-). Any ideas of the command to do this? Or will I have to write a function to do it myself?
m<-lm(Alopecurus.geniculatus~Year)
> summary(m)
Call:
lm(formula = Alopecurus.geniculatus ~ Year)
Residuals:
Min 1Q Median 3Q Max
-19.374 -8.667 -2.094 9.601 21.832
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 700.3921 302.2936 2.317 0.0275 *
Year -0.2757 0.1530 -1.802 0.0817 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.45 on 30 degrees of freedom
(15 observations deleted due to missingness)
Multiple R-squared: 0.09762, Adjusted R-squared: 0.06754
F-statistic: 3.246 on 1 and 30 DF, p-value: 0.08168
So I get a Standard Error output and I was hoping to get a Variance output without calculating it by hand...
I'm not sure what you want the variance of.
If you want the residual variance, it's: (summary(m)$sigma)**2.
If you want the variance of your slope, it's: (summary(m)$coefficients[2,2])**2, or vcov(m)[2,2].
vcov(m)
gives the covariance matrix of the coefficients – variances on the diagonal.
if you're referring to the standard errors for the coefficient estimates, the answer is
summary(m)$coef[,2]
and if you're referring to the estimated residual variance, it's
summary(m)$sigma
Type names( summary(m) ) and names(m) for other information you can access.

How to predict a response for a GLM using my values?

Apologies for any bad English, it is not my first language :)
So I have a dataset of the passengers of the titanic, and produced the following fit summary:
glm(formula = Survived ~ factor(Pclass) + Age + I(Age^2) + Sex +
Fare + I(Fare^2), family = binomial(), data = titan)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7298 -0.6738 -0.3769 0.6291 2.4821
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.678e+00 6.321e-01 7.401 1.35e-13 ***
factor(Pclass)2 -1.543e+00 3.525e-01 -4.377 1.20e-05 ***
factor(Pclass)3 -2.909e+00 3.882e-01 -7.494 6.69e-14 ***
Age -6.813e-02 2.196e-02 -3.102 0.00192 **
I(Age^2) 4.620e-04 3.193e-04 1.447 0.14792
Sexmale -2.595e+00 2.131e-01 -12.177 < 2e-16 ***
Fare -9.800e-03 5.925e-03 -1.654 0.09815 .
I(Fare^2) 2.798e-05 1.720e-05 1.627 0.10373
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 964.52 on 713 degrees of freedom
Residual deviance: 641.74 on 706 degrees of freedom
(177 observations deleted due to missingness)
AIC: 657.74
Number of Fisher Scoring iterations: 5
Now I'm trying to predict the survival probability of a female aged 21 who paid 35 for her ticket fare.
I'm unable to use predict or predict.glm and am unsure why. I run the following and produce this error:
predict(glmfit, data.frame(PClass=2, Sex="female", Age=20), type="response")
Error in factor(Pclass) : object 'Pclass' not found
I then just try to calculate it the long-way, that is by multiplying my coefficients to the desired values but the answer I get there is not right either.
(4.678e+00)+(1*-1.543e+00)+(21*-6.813e-02)+((21^2)*4.620e-04)+(35*-9.800e-03)+((35^2)*2.798e-05)
[1] 1.599287
Not sure how I could a probability greater than 1, especially when my response is a binomial factor of 0 or 1.
Could someone please shed some light on my mistakes? Thanks in advance.
If you want to calculate the probability by hand, then follow the steps
Multiply coefficients to the desired values
Take exponential of the output from step 1
Probability = output of step 2/(1 + output of step 2)
In your case, the output of step 1 is 1.599287. The output of step 2 will be exp(1.599287) = 4.949502. Then probability = 4.949502/(1 + 4.949502) = 0.8319187.
So, in R you can create your own function like
logit2prob <- function(logit){
odds <- exp(logit)
prob <- odds / (1 + odds)
return(prob)
}
For more details, you can visit this.
Otherwise, the suggestion by #Roland should work fine.

plm vs lm - different results?

I tried several times to use lm and plm to do a regression. And I get different results.
First, I used lm as follows:
fixed.Region1 <- lm(CapNormChange ~ Policychanges + factor(Region),
data=Panel)
Further I used plm in the following way:
fixed.Region2 <- plm(CapNormChange ~ Policychanges+ factor(Region),
data=Panel, index=c("Region", "Year"), model="within", effect="individual")
I think there is something wrong with plm because I don't see an intercept in the results (see below).
Furthermore, I am not entirely sure if + factor (Region) is necessary, however, if it is not there, I don't see the coefficients (and significance) for the dummy.
So, my question is:
I am using the plm function wrong? (or what is wrong about it)
If not, how can it be that the results are different?
If somebody could give me a hint, I would really appreciate.
Results from LM:
Call:
lm(formula = CapNormChange ~ Policychanges + factor(Region),
data = Panel)
Residuals:
Min 1Q Median 3Q Max
-31.141 -4.856 -0.642 1.262 192.803
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.3488 4.9134 3.531 0.000558 ***
Policychanges 0.6412 0.1215 5.277 4.77e-07 ***
factor(Region)Asia -19.3377 6.7804 -2.852 0.004989 **
factor(Region)C America + Carib 0.1147 6.8049 0.017 0.986578
factor(Region)Eurasia -17.6476 6.8294 -2.584 0.010767 *
factor(Region)Europe -20.7759 8.8993 -2.335 0.020959 *
factor(Region)Middle East -17.3348 6.8285 -2.539 0.012200 *
factor(Region)N America -17.5932 6.8064 -2.585 0.010745 *
factor(Region)Oceania -14.0440 6.8417 -2.053 0.041925 *
factor(Region)S America -14.3580 6.7781 -2.118 0.035878 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.72 on 143 degrees of freedom
Multiple R-squared: 0.3455, Adjusted R-squared: 0.3043
F-statistic: 8.386 on 9 and 143 DF, p-value: 5.444e-10`
Results from PLM:
Call:
plm(formula = CapNormChange ~ Policychanges, data = Panel, effect = "individual",
model = "within", index = c("Region", "Year"))
Balanced Panel: n = 9, T = 17, N = 153
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-31.14147 -4.85551 -0.64177 1.26236 192.80277
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Policychanges 0.64118 0.12150 5.277 4.769e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 66459
Residual Sum of Squares: 55627
R-Squared: 0.16299
Adj. R-Squared: 0.11031
F-statistic: 27.8465 on 1 and 143 DF, p-value: 4.7687e-07`
You would need to leave out + factor(Region) in your formula for the within model with plm to get what you want.
Within models do not have an intercept, but some software packages (esp. Stata and Gretl) report one. You can estimate it with plm by running within_intercept on you estimated model. The help page has the details about this somewhat artificial intercept.
If you want the individual effects and their significance, use summary(fixef(<your_plm_model>)). Use pFtest to check if the within specification seems worthwhile.
The R squareds diverge between the lm model and the plm model. This is due to the lm model (if used like this with the dummies, it is usually called the LSDV model (least squares dummy variables)) gives what is sometimes called the overall R squared while plm will give you the R squared of the demeaned regression, sometimes called the within R squared. Stata's documentation has some details about this: https://www.stata.com/manuals/xtxtreg.pdf

How do I print the variance of an lm in R without computing from the Standard Error by hand?

Simple question really! I am running lots of linear regressions of y~x and want to obtain the variance for each regression without computing it from hand from the Standard Error output given in the summary.lm command. Just to save a bit of time :-). Any ideas of the command to do this? Or will I have to write a function to do it myself?
m<-lm(Alopecurus.geniculatus~Year)
> summary(m)
Call:
lm(formula = Alopecurus.geniculatus ~ Year)
Residuals:
Min 1Q Median 3Q Max
-19.374 -8.667 -2.094 9.601 21.832
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 700.3921 302.2936 2.317 0.0275 *
Year -0.2757 0.1530 -1.802 0.0817 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.45 on 30 degrees of freedom
(15 observations deleted due to missingness)
Multiple R-squared: 0.09762, Adjusted R-squared: 0.06754
F-statistic: 3.246 on 1 and 30 DF, p-value: 0.08168
So I get a Standard Error output and I was hoping to get a Variance output without calculating it by hand...
I'm not sure what you want the variance of.
If you want the residual variance, it's: (summary(m)$sigma)**2.
If you want the variance of your slope, it's: (summary(m)$coefficients[2,2])**2, or vcov(m)[2,2].
vcov(m)
gives the covariance matrix of the coefficients – variances on the diagonal.
if you're referring to the standard errors for the coefficient estimates, the answer is
summary(m)$coef[,2]
and if you're referring to the estimated residual variance, it's
summary(m)$sigma
Type names( summary(m) ) and names(m) for other information you can access.

Resources