How can I change the default object value for lm()? - r

Apologies for any grammar issues as English is not my first language.
I am trying to investigate data in regards to whether black people are discriminated against in comparison to white people when submitting their resumes (Bertrand and Mullainathan, 2004).
I do the following:
> resume <- read.csv("resume.csv", header=T)
> fit <- lm(call~race, data=resume)
> summary(fit1)
Call:
lm(formula = call ~ race, data = resume)
Residuals:
Min 1Q Median 3Q Max
-0.09651 -0.09651 -0.06448 -0.06448 0.93552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.064476 0.005505 11.713 < 2e-16 ***
racewhite 0.032033 0.007785 4.115 3.94e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2716 on 4868 degrees of freedom
Multiple R-squared: 0.003466, Adjusted R-squared: 0.003261
F-statistic: 16.93 on 1 and 4868 DF, p-value: 3.941e-05
As you can see from the summary, it displays 'racewhite' as the variable and I have no idea how to change this so it instead displays 'raceblack'.
I know this might be quite a simple question, but thank you in advance for helping me out :)

lm appears to be treating race as a factor. By default, the first level of the factor is used as a control and coefficients for the other levels represent differences between the given level and the control. The "first" level of a factor defined by character values is the first level when the levels are sorted alphabetically.
You can change this by using reorder before the call to lm. Or, assuming race has only two levels, you can simply reinterpret the coefficient by changing it's sign: based on the results above, the coefficient for race regarding white as control will be -0.032033. All other statistics - std err, p-value etc will be unchanged.
It would have been helpful to see at least some of your input data.

Related

Binomial negative distribution

I am learning how to use glms to test hypothesis and to see how variables relate among themselves.
I am trying to see if the variable tick prevalence (Parasitized individuals/Assessed individuals)(dependent variable) is influenced by the number of captured hosts (independent variable).
My data looks like figure 1.(116 observations).
I have read that one way to know which distribution to use is to see which distribution the dependent variable has. So I built a histogram for the TickPrev variable (figure 2).
I got to the conclusion that the binomial negative distribution would be the best option. Before I ran the analysis, I transformed the TickPrevalence variable (it was a proportion, and the glm.nb only works with integers) applying the following codes:
df <- df %>% mutate(TickPrev=TickPrev*100)
df$TickPrev <- as.integer(df$TickPrev)
Then I applied the glm.nb function from the MASS package, and obtained this summary
summary(glm.nb(df$TickPrev~df$Captures, link=log))
Call:
glm.nb(formula = df15$TickPrev ~ df15$Captures, link = log, init.theta = 1.359186218)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.92226 -0.69841 -0.08826 0.44562 1.70405
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.438249 0.125464 27.404 <2e-16 ***
df15$Captures -0.008528 0.004972 -1.715 0.0863 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.3592) family taken to be 1)
Null deviance: 144.76 on 115 degrees of freedom
Residual deviance: 141.90 on 114 degrees of freedom
AIC: 997.58
Number of Fisher Scoring iterations: 1
Theta: 1.359
Std. Err.: 0.197
2 x log-likelihood: -991.584
I know that the p-value indicates that there isn't enough proves to believe that the two variables are related. However, I am not sure if I used the best model to fit the data and how I can know that. Can you please help me? Also, knowing what I show, is there a better way to see if this variables are related?
Thank you very much.

How do I print the growth rate for an exponential regression model?

So, I created an exponential regression using 50 data points taken over 50 days. Finding the summary of it resulted in the following:
> summary(TotalModel)
Call:
lm(formula = log(Total) ~ Time)
Residuals:
Min 1Q Median 3Q Max
-1.0570 -0.4827 -0.1168 0.5545 0.8195
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.537196 0.165779 27.37 <2e-16 ***
Time 0.148937 0.005658 26.32 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5773 on 48 degrees of freedom
Multiple R-squared: 0.9352, Adjusted R-squared: 0.9339
F-statistic: 692.9 on 1 and 48 DF, p-value: < 2.2e-16
Now, while this does provide me with some of the information I needed, I want to take the growth rate of this exponential model and print it to a variable that I will later export to a spreadsheet (and repeat this about 13 more times). How do I get this value?
First of all, in your example above, the fitted model is:
log(Total) = 4.537196 + 0.148937 * Time
Exponentiating both sides, we have:
Total = exp(4.537196)*exp(0.148937 * Time)
Your growth rate is 0.148937 which is the coefficient of the Time variable. To extract this value, you could do
TotalModel$coefficients["Time"]
In general, in the future, if you're trying to see how you can extract a specific value from an object, a few things you can try include: (will not always work for every object but worth a try)
Look at the structure of your model by doing str(TotalModel). In your case, you will see that one of the elements is coefficients, which you can access using TotalModel$coefficients
Type TotalModel$ (with the dollar sign) and hit the tab key on your keyboard to see what possible elements can be accessed.
So a generic exponential function has the following form, where is the exponential growth rate:
Looks like you have estimated the transformed log-linear model, ie:
Or, put another way, the OLS model you have estimated is:
where and .
So then, the exponential growth rate is equal to:
You can calculate this from the TotalModel object, where TotalModel$coefficient['Time'] is and so, r = exp(TotalModel$coefficient['Time']) - 1.

Binary Logistic Regression output

I'm an undergrad student and am currently struggling with R, i'be been trying to teach myself for weeks but I'm not a natural, so I thought i'd seek some support.
I'm currently trying to analyse the interaction of my variables on recall of a target using logistic regression, as specified by my tutor. I have a 2 (isolate x control condition)by 2 (similarity/difference list type) study, and my dependent variable is binary of recall (yes or no). I've tried to clean my data and run the code,
Call:
glm(formula = Target ~ Condition * List, family = "binomial",
data = pro)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8297 -0.3288 0.6444 0.6876 2.4267
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.4663 0.6405 2.289 0.022061 *
Conditionisolate -1.1097 0.8082 -1.373 0.169727
Listsim -4.3567 1.2107 -3.599 0.000320 ***
Conditionisolate:Listsim 5.3218 1.4231 3.740 0.000184 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 97.736 on 70 degrees of freedom
Residual deviance: 65.869 on 67 degrees of freedom
AIC: 73.869
that's my output^ it completely ignores the difference and control condition, I know i'm doing something wrong and i'm feeling quite exacerbated by it. Can any one help me?
In the model output, R is treating control and difference as the baseline levels of your two variables. The outcome associated with them is wrapped up in the intercept. For other combinations of variable levels, the coefficients show how those differ from that baseline.
Control/Difference: just use the intercept
Control/Similarity: intercept + listsim
Isolate/Difference: intercept + conditionisolate
Isolate/Similarity: intercept + listsim + conditionisolate + conditionisolate:listsim

Anova table comparing groups, in R, exported to latex?

I'm mostly work with observational data, but I read a lot of experimental hard-science papers that report results in the form of anova tables, with letters indicating the significance of the differences between the groups, and then p-values of the f-stat for the joint significance of what is essentially a factor variable regression. Here is an example that I've pulled off of google image search.
I think that this might be a useful way to present summary statistics about groupwise differences (or lack thereof) in an observational dataset, before I go ahead and try to control for them in various ways. I'm not sure exactly what test the letters are typically representing (Tukey something?), but pairwise t-tests would suit my purposes fine.
My main question: how can I get such an output from a factor variable regression in R, and how can I seamlessly export it into latex?
Here is some example data:
var = c(3500,600,12400,6400,1500,0,4400,400,900,2000,350,0,5800,0,12800,1200,350,800,2500,2000,0,3200,1100,0,0,0,0,0,1000,0,0,0,0,0,12400,6000,1700,3500,3000,1000,0,0,3500,5000,1000,3600,1600,3500,0,900,4200,0,0,0,0,1700,1250,500,950,500,600,1080,500,980,670,1200,600,550,4000,600,2800,650,0,3700,12500,0,0,0,1200,2700,0,NA,0,0,0,3700,2000,3500,0,0,0,3500,800,1400,0,500,7000,3500,0,0,0,0,2700,0,0,0,0,2000,5000,0,0,7000,0,4800,0,0,0,0,1800,0,2500,1600,4600,0,2000,5400,4500,3200,0,12200,0,3500,0,0,2800,3600,3000,0,3150,0,0,3750,2800,0,1000,1500,6000,3090,2800,600,0,0,1000,3800,3000,0,800,600,1200,0,240,1000,300,3600,0,1200,300,2700,NA,1300,1200,1400,4600,3200,750,300,750,1200,700,870,900,3200,1300,1500,1200,0,960,1800,8000,1200,NA,0,1080,1300,1080,900,700,5000,1500,3750,0,1400,900,1400,400,3900,0,1400,1600,960,1200,2600,420,3400,2500,500,4000,0,4250,570,600,4550,2000,0,0,4300,2000,0,0,0,0,NA,0,2060,2600,1600,1800,3000,900,0,0,3200,0,1500,3000,0,3700,6000,0,0,1250,1200,12800,0,1000,1100,0,950,2500,800,3000,3600,3600,1500,0,0,3600,800,0,1000,1600,1700,0,3500,3700,3000,350,700,3500,0,0,0,0,1500,0,400,0,0,0,0,0,0,0,500,0,0,0,0,5600,0,0,0)
factor = as.factor(c(5,2,5,5,5,3,4,5,5,5,3,1,1,1,5,3,6,6,6,5,5,5,3,5,3,3,3,3,4,3,3,3,4,3,5,5,3,5,3,3,3,3,5,3,3,3,3,3,5,5,5,5,5,3,3,5,3,5,5,3,5,5,4,3,5,5,5,5,5,5,4,5,3,5,4,4,3,4,3,5,3,3,5,5,5,3,5,5,4,3,3,5,5,4,3,3,5,3,3,4,3,3,3,3,5,5,3,5,5,3,3,5,4,3,3,3,4,4,5,3,1,5,5,1,5,5,5,3,3,4,5,5,5,3,3,4,5,4,5,3,5,5,5,3,3,3,3,3,3,3,3,3,3,3,4,3,3,3,3,3,3,3,4,5,4,6,4,3,5,5,3,5,3,3,4,3,5,5,5,3,5,3,3,5,5,5,3,4,3,3,3,5,3,5,3,5,5,3,5,3,5,5,5,5,5,3,5,3,5,3,4,5,5,5,6,5,5,5,5,4,5,3,5,3,3,5,4,3,5,3,4,5,3,5,3,5,3,1,5,1,5,3,5,5,5,3,6,3,5,3,5,2,5,5,5,1,5,5,6,5,4,5,4,3,3,3,5,3,3,3,3,5,3,3,3,3,3,3,5,5,5,4,4,4,5,5,3,5,4,5,5,4,3,3,3,4,3,5,5,4,3,3))
do a simple regression on them and you get the following
m = lm((var-mean(var,na.rm=TRUE))~factor-1)
summary(m)
Call:
lm(formula = (var - mean(var, na.rm = TRUE)) ~ factor - 1)
Residuals:
Min 1Q Median 3Q Max
-2040.5 -1240.2 -765.5 957.1 10932.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
factor1 -82.42 800.42 -0.103 0.9181
factor2 -732.42 1600.84 -0.458 0.6476
factor3 -392.17 204.97 -1.913 0.0567 .
factor4 -65.19 377.32 -0.173 0.8629
factor5 408.07 204.13 1.999 0.0465 *
factor6 303.30 855.68 0.354 0.7233
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2264 on 292 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.02677, Adjusted R-squared: 0.006774
F-statistic: 1.339 on 6 and 292 DF, p-value: 0.2397
It looks pretty clear that factors 3 and 5 are different from zero, different from each other, but that factor 3 is not different from 2 and factor 5 is not different from 6, respectively (at whatever p value).
How can I get this into anova table output like in the example above? And is there a clean way to get this into latex, ideally in a form that allows a lot of variables?
The following answers only the third question.
It looks like xtable does what you'd like to do - exporting R tables to $\LaTeX$ code.
There's a nice gallery as well.
I've found both in a wiki post on stackoverflow.

R-squared for observed and modeled relative to 1:1 line in R

This question may sound a little bit weird. I want to know how can I report the R-sqaured value in R compared to 1:1 line. For example I want to compare the observed and modeled values. In the ideal case, it should be a straight line passing through an origin at an angle of 45 degrees.
For example I have a data which can be found on https://www.dropbox.com/s/71u2vsgt7p9k5cl/correlationcsv
The code I wrote is as follows:
> corsen<- read.table("Sensitivity Runs/correlationcsv",sep=",",header=T)
> linsensitivity <- lm(data=corsen,sensitivity~0+observed)
> summary(linsensitivity)
Call:
lm(formula = sensitivity ~ 0 + observed, data = corsen)
Residuals:
Min 1Q Median 3Q Max
-0.37615 -0.03376 0.00515 0.04155 0.27213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
observed 0.833660 0.001849 450.8 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05882 on 2988 degrees of freedom
Multiple R-squared: 0.9855, Adjusted R-squared: 0.9855
F-statistic: 2.032e+05 on 1 and 2988 DF, p-value: < 2.2e-16
The plot looks like following:
ggplot(corsen,aes(observed,sensitivity))+geom_point()+geom_smooth(method="lm",aes(color="red"))+
ylab(" Modeled (m)")+xlab("Observed (m)")+
geom_line(data=oneline,aes(x=onelinex,y=oneliney,color="blue"))+
scale_color_manual("",values=c("red","blue"),label=c("1:1 line","Regression Line"))+theme_bw()+theme(legend.position="top")+
coord_cartesian(xlim=c(-0.2,2),ylim=c(-0.2,2))
My question is that if we look closely the data are off from the 1:1 line. How can I find the R-squared relative to the 1:1 line ? Right now the linear model I used is regardless of the line specified. It is purely based on the data provided.
You can calculate the residuals and sum their squares:
resid2 <- with( corsen, sum( sensitivity-observed)^2 ))
If you wanted an R^2 like number I suppose you could calculate:
R2like <- 1 - resid2/ with(corsen, sum( sensitivity^2))

Resources