I am learning how to use glms to test hypothesis and to see how variables relate among themselves.
I am trying to see if the variable tick prevalence (Parasitized individuals/Assessed individuals)(dependent variable) is influenced by the number of captured hosts (independent variable).
My data looks like figure 1.(116 observations).
I have read that one way to know which distribution to use is to see which distribution the dependent variable has. So I built a histogram for the TickPrev variable (figure 2).
I got to the conclusion that the binomial negative distribution would be the best option. Before I ran the analysis, I transformed the TickPrevalence variable (it was a proportion, and the glm.nb only works with integers) applying the following codes:
df <- df %>% mutate(TickPrev=TickPrev*100)
df$TickPrev <- as.integer(df$TickPrev)
Then I applied the glm.nb function from the MASS package, and obtained this summary
summary(glm.nb(df$TickPrev~df$Captures, link=log))
Call:
glm.nb(formula = df15$TickPrev ~ df15$Captures, link = log, init.theta = 1.359186218)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.92226 -0.69841 -0.08826 0.44562 1.70405
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.438249 0.125464 27.404 <2e-16 ***
df15$Captures -0.008528 0.004972 -1.715 0.0863 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.3592) family taken to be 1)
Null deviance: 144.76 on 115 degrees of freedom
Residual deviance: 141.90 on 114 degrees of freedom
AIC: 997.58
Number of Fisher Scoring iterations: 1
Theta: 1.359
Std. Err.: 0.197
2 x log-likelihood: -991.584
I know that the p-value indicates that there isn't enough proves to believe that the two variables are related. However, I am not sure if I used the best model to fit the data and how I can know that. Can you please help me? Also, knowing what I show, is there a better way to see if this variables are related?
Thank you very much.
Related
Simple question really! I am running lots of linear regressions of y~x and want to obtain the variance for each regression without computing it from hand from the Standard Error output given in the summary.lm command. Just to save a bit of time :-). Any ideas of the command to do this? Or will I have to write a function to do it myself?
m<-lm(Alopecurus.geniculatus~Year)
> summary(m)
Call:
lm(formula = Alopecurus.geniculatus ~ Year)
Residuals:
Min 1Q Median 3Q Max
-19.374 -8.667 -2.094 9.601 21.832
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 700.3921 302.2936 2.317 0.0275 *
Year -0.2757 0.1530 -1.802 0.0817 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.45 on 30 degrees of freedom
(15 observations deleted due to missingness)
Multiple R-squared: 0.09762, Adjusted R-squared: 0.06754
F-statistic: 3.246 on 1 and 30 DF, p-value: 0.08168
So I get a Standard Error output and I was hoping to get a Variance output without calculating it by hand...
I'm not sure what you want the variance of.
If you want the residual variance, it's: (summary(m)$sigma)**2.
If you want the variance of your slope, it's: (summary(m)$coefficients[2,2])**2, or vcov(m)[2,2].
vcov(m)
gives the covariance matrix of the coefficients – variances on the diagonal.
if you're referring to the standard errors for the coefficient estimates, the answer is
summary(m)$coef[,2]
and if you're referring to the estimated residual variance, it's
summary(m)$sigma
Type names( summary(m) ) and names(m) for other information you can access.
I'm an undergrad student and am currently struggling with R, i'be been trying to teach myself for weeks but I'm not a natural, so I thought i'd seek some support.
I'm currently trying to analyse the interaction of my variables on recall of a target using logistic regression, as specified by my tutor. I have a 2 (isolate x control condition)by 2 (similarity/difference list type) study, and my dependent variable is binary of recall (yes or no). I've tried to clean my data and run the code,
Call:
glm(formula = Target ~ Condition * List, family = "binomial",
data = pro)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8297 -0.3288 0.6444 0.6876 2.4267
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.4663 0.6405 2.289 0.022061 *
Conditionisolate -1.1097 0.8082 -1.373 0.169727
Listsim -4.3567 1.2107 -3.599 0.000320 ***
Conditionisolate:Listsim 5.3218 1.4231 3.740 0.000184 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 97.736 on 70 degrees of freedom
Residual deviance: 65.869 on 67 degrees of freedom
AIC: 73.869
that's my output^ it completely ignores the difference and control condition, I know i'm doing something wrong and i'm feeling quite exacerbated by it. Can any one help me?
In the model output, R is treating control and difference as the baseline levels of your two variables. The outcome associated with them is wrapped up in the intercept. For other combinations of variable levels, the coefficients show how those differ from that baseline.
Control/Difference: just use the intercept
Control/Similarity: intercept + listsim
Isolate/Difference: intercept + conditionisolate
Isolate/Similarity: intercept + listsim + conditionisolate + conditionisolate:listsim
Let me first note that I haven't been able to reproduce this error on anything outside of my data set. However, here is the general idea. I have a data frame and I'm trying to build a simple logistic regression to understand the marginal effect of Amount on IsWon. Both models perform poorly, it's one predictor after all, but they produce two different coefficients
First is the glm output:
> summary(mod4)
Call:
glm(formula = as.factor(IsWon) ~ Amount, family = "binomial",
data = final_data_obj_samp)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2578 -1.2361 1.0993 1.1066 3.7307
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.18708622416 0.03142171761 5.9540 0.000000002616 ***
Amount -0.00000315465 0.00000035466 -8.8947 < 0.00000000000000022 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6928.69 on 4999 degrees of freedom
Residual deviance: 6790.87 on 4998 degrees of freedom
AIC: 6794.87
Number of Fisher Scoring iterations: 6
Notice that negative coefficient for Amount.
And now the lrm function from rms
Logistic Regression Model
lrm(formula = as.factor(IsWon) ~ Amount, data = final_data_obj_samp,
x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 5000 LR chi2 137.82 R2 0.036 C 0.633
0 2441 d.f. 1 g 0.300 Dxy 0.266
1 2559 Pr(> chi2) <0.0001 gr 1.350 gamma 0.288
max |deriv| 0.0007 gp 0.054 tau-a 0.133
Brier 0.242
Coef S.E. Wald Z Pr(>|Z|)
Intercept 0.1871 0.0314 5.95 <0.0001
Amount 0.0000 0.0000 -8.89 <0.0001
Both models do a poor job, but one estimates a positive coefficient and the other a negative coefficient. Sure, the values are negligible, but can someone help me understand this.
For what it's worth, here's what the plot of the lrm object looks like.
> plot(Predict(mod2, fun=plogis))
The plot shows the predicted probabilities of winning have a very negative relationship with Amount.
It seems like lrm is estimating the coefficient to the nearest ±0.0000 value. Since the coefficient value is well below that, it is simply rounding it to 0.0000. Hence it seems positive but may in fact not be.
You should not rely on the printed result from summary to check for coefficients. The summary table is controlled by print, hence will always subject to rounding problem. Have you tried mod4$coef (get coefficients of glm model mod4) and mod2$coef (get coefficients of lrm model mod2)? It is good idea to read the "values" section of ?glm and ?lrm.
Simple question really! I am running lots of linear regressions of y~x and want to obtain the variance for each regression without computing it from hand from the Standard Error output given in the summary.lm command. Just to save a bit of time :-). Any ideas of the command to do this? Or will I have to write a function to do it myself?
m<-lm(Alopecurus.geniculatus~Year)
> summary(m)
Call:
lm(formula = Alopecurus.geniculatus ~ Year)
Residuals:
Min 1Q Median 3Q Max
-19.374 -8.667 -2.094 9.601 21.832
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 700.3921 302.2936 2.317 0.0275 *
Year -0.2757 0.1530 -1.802 0.0817 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.45 on 30 degrees of freedom
(15 observations deleted due to missingness)
Multiple R-squared: 0.09762, Adjusted R-squared: 0.06754
F-statistic: 3.246 on 1 and 30 DF, p-value: 0.08168
So I get a Standard Error output and I was hoping to get a Variance output without calculating it by hand...
I'm not sure what you want the variance of.
If you want the residual variance, it's: (summary(m)$sigma)**2.
If you want the variance of your slope, it's: (summary(m)$coefficients[2,2])**2, or vcov(m)[2,2].
vcov(m)
gives the covariance matrix of the coefficients – variances on the diagonal.
if you're referring to the standard errors for the coefficient estimates, the answer is
summary(m)$coef[,2]
and if you're referring to the estimated residual variance, it's
summary(m)$sigma
Type names( summary(m) ) and names(m) for other information you can access.
I have some datas which looks like obeying gausssian distribution. So i use
my.glm<- glm(b1~a1,family=Gaussian)
and then use command
summary(my.glm).
The results are:
Call:
glm(formula = b1 ~ a1, family = gaussian)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.067556 -0.029598 0.002121 0.030980 0.044499
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.433697 0.018629 23.28 1.36e-12 ***
a1 -0.027146 0.001927 -14.09 1.16e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.001262014)
Null deviance: 0.268224 on 15 degrees of freedom
Residual deviance: 0.017668 on 14 degrees of freedom
AIC: -57.531
Number of Fisher Scoring iterations: 2
I think they fit well. But how can i draw a gaussian curve on these datas?
Assuming that the intercept has a normal distribution, you can plot its distribution like this:
x <- seq(0.3,0.6,by =0.001)
plot(x, dnorm(x, 0.433697, 0.018629), type = 'l')
and you might want to add your data:
rug(b1)
since you didn't supply data, we can make some up (with some transforms to match stats in the example):
set.seed(0)
b <- rnorm(15)
b1 <- ((b - mean(b))/sd(b) * 0.018629) + 0.433697
rug(b1)
you could also overlay a kernel density estimate of the data
lines(density(b1), col = 'red')
Giving the following plot:
Simple: ?dnorm
Use dnorm to create a gaussian curve of desired mean and s.d. without tying yourself to any numerically fitted function. This is a simple, and good, way to show how your data 'fits' to a theoretical curve. Not the same thing as plotting the fitted data and trying to figure out "how close" to a gaussian it is.