This question already has an answer here:
lm function in R does not give coefficients for all factor levels in categorical data
(1 answer)
Closed 6 years ago.
With this output, I know that intercept is when both factors are 0. I understand that factor(V1)1 means V1=1 and factor(V2)1 means V2=1. To get the slope for just V1 being = 1 I would add 5.1122 +(-0.4044). However, I am wondering how to interpret the p-values in this output. If just V1 = 1, does that mean the p-value is 2.39e-12 + 0.376? If so, every model I run is only significant when all factors = 0...
> lm.comfortgender=lm(V13~factor(V1)+factor(V2),data=comfort.txt)
> summary(lm.comfortgender)
Call:
lm(formula = V13 ~ factor(V1) + factor(V2), data = comfort.txt)
Residuals:
Min 1Q Median 3Q Max
-3.5676 -1.0411 0.1701 1.4324 2.0590
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1122 0.5244 9.748 2.39e-12 ***
factor(V1)1 -0.4044 0.4516 -0.895 0.376
factor(V2)1 0.2332 0.5105 0.457 0.650
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.487 on 42 degrees of freedom
Multiple R-squared: 0.02793, Adjusted R-squared: -0.01836
F-statistic: 0.6033 on 2 and 42 DF, p-value: 0.5517
The p-values given as output in the R regression models test the null hypothesis that the mean of the distribution of that particular coefficient is zero, under the assumption that the distribution is normal and the standard deviation is the square root of the variance. Refer to this other answer for further clarifications.
Related
Is there a function in R to calculate the critical value of F-statistic and compare it to the F-statistic to determine if it is significant or not? I have to calculate thousands of linear models and at the end create a dataframe with the r squared values, p-values, f-statistic, coefficients etc. for each linear model.
> summary(mod)
Call:
lm(formula = log2umi ~ Age + Sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.01173 -0.01173 -0.01173 -0.01152 0.98848
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0115203 0.0018178 6.337 2.47e-10 ***
Age -0.0002679 0.0006053 -0.443 0.658
SexM 0.0002059 0.0024710 0.083 0.934
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1071 on 7579 degrees of freedom
Multiple R-squared: 2.644e-05, Adjusted R-squared: -0.0002374
F-statistic: 0.1002 on 2 and 7579 DF, p-value: 0.9047
I am aware of this question: How do I get R to spit out the critical value for F-statistic based on ANOVA?
But is there one function on its own that will compare the two values and spit out True or False?
EDIT:
I wrote this, but just out of curiosity if anyone knows a better way please let me know.
f_sig is a named vector that I will later add to the dataframe
model <- lm(log2umi~Age + Sex, df)
f_crit <- qf(1-0.05, summary(model)$fstatistic[2], summary(model)$fstatistic[3] )
f <- summary(mod)$fstatistic[1]
if (f > f_crit) {
f_sig[gen] = 0 #True
} else {
f_sig[gen] = 1 #False
}
The output for the lm model with two categorical variables is:
Call:
lm(formula = exit_irr ~ type_exit + domicile, data = pe1)
Residuals:
Min 1Q Median 3Q Max
-0.73013 -0.17926 -0.05142 0.03945 2.85043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05333 0.22282 0.239 0.81101
type_exitTrade Sale -0.11871 0.05469 -2.171 0.03081
type_exitUnlisted -0.21208 0.07536 -2.814 0.00525
domicileKSA 0.14593 0.22852 0.639 0.52363
domicileKuwait 0.14679 0.22847 0.643 0.52108
domicileOM 0.08708 0.28225 0.309 0.75791
domicileUAE 0.18623 0.22808 0.817 0.41491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3859 on 274 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.04221, Adjusted R-squared: 0.02124
F-statistic: 2.013 on 6 and 274 DF, p-value: 0.06415
How to write equation of linear regression with categorical predictors?
the function lm() in r automatically accounts for categorical variables. It produces dummy variables of your categorical variables and does regression on it. Make sure your Categorical variables are of class factor. This can be done as:
pe1$type_exit <- as.factor(pe1$type_exit)
pe1$domicile <- as.factor(pe1$domicile)
I have considered type_exit and domicile tobe your categorical columns.
Based upon answers of my question, I am supposed to get same values of intercept and the regression coefficient for below 2 models. But they are not the same. What is going on?
is something wrong with my code? Or is the original answer wrong?
#linear regression average qty per price point vs all quantities
x1=rnorm(30,20,1);y1=rep(3,30)
x2=rnorm(30,17,1.5);y2=rep(4,30)
x3=rnorm(30,12,2);y3=rep(4.5,30)
x4=rnorm(30,6,3);y4=rep(5.5,30)
x=c(x1,x2,x3,x4)
y=c(y1,y2,y3,y4)
plot(y,x)
cor(y,x)
fit=lm(x~y)
attributes(fit)
summary(fit)
xdum=c(20,17,12,6)
ydum=c(3,4,4.5,5.5)
plot(ydum,xdum)
cor(ydum,xdum)
fit1=lm(xdum~ydum)
attributes(fit1)
summary(fit1)
> summary(fit)
Call:
lm(formula = x ~ y)
Residuals:
Min 1Q Median 3Q Max
-8.3572 -1.6069 -0.1007 2.0222 6.4904
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.0952 1.1570 34.65 <2e-16 ***
y -6.1932 0.2663 -23.25 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.63 on 118 degrees of freedom
Multiple R-squared: 0.8209, Adjusted R-squared: 0.8194
F-statistic: 540.8 on 1 and 118 DF, p-value: < 2.2e-16
> summary(fit1)
Call:
lm(formula = xdum ~ ydum)
Residuals:
1 2 3 4
-0.9615 1.8077 -0.3077 -0.5385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.2692 3.6456 10.497 0.00895 **
ydum -5.7692 0.8391 -6.875 0.02051 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.513 on 2 degrees of freedom
Multiple R-squared: 0.9594, Adjusted R-squared: 0.9391
F-statistic: 47.27 on 1 and 2 DF, p-value: 0.02051
You are not calculating xdum and ydum in a comparable fashion because rnorm will only approximate the mean value you specify, particularly when you are sampling only 30 cases. This is easily fixed however:
coef(fit)
#(Intercept) y
# 39.618472 -6.128739
xdum <- c(mean(x1),mean(x2),mean(x3),mean(x4))
ydum <- c(mean(y1),mean(y2),mean(y3),mean(y4))
coef(lm(xdum~ydum))
#(Intercept) ydum
# 39.618472 -6.128739
In theory they should be the same if (and only if) the mean of the former model is equal to the point in the latter model.
This is not the case in your models, so the results are slightly different. For example the mean of x1:
x1=rnorm(30,20,1)
mean(x1)
20.08353
where the point version is 20.
There are similar tiny differences from your other rnorm samples:
> mean(x2)
[1] 17.0451
> mean(x3)
[1] 11.72307
> mean(x4)
[1] 5.913274
Not that this really matters, but just FYI the standard nomenclature is that Y is the dependent variable and X is the independent variable, which you reversed. Makes no difference of course, but just so you know.
This question may sound a little bit weird. I want to know how can I report the R-sqaured value in R compared to 1:1 line. For example I want to compare the observed and modeled values. In the ideal case, it should be a straight line passing through an origin at an angle of 45 degrees.
For example I have a data which can be found on https://www.dropbox.com/s/71u2vsgt7p9k5cl/correlationcsv
The code I wrote is as follows:
> corsen<- read.table("Sensitivity Runs/correlationcsv",sep=",",header=T)
> linsensitivity <- lm(data=corsen,sensitivity~0+observed)
> summary(linsensitivity)
Call:
lm(formula = sensitivity ~ 0 + observed, data = corsen)
Residuals:
Min 1Q Median 3Q Max
-0.37615 -0.03376 0.00515 0.04155 0.27213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
observed 0.833660 0.001849 450.8 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05882 on 2988 degrees of freedom
Multiple R-squared: 0.9855, Adjusted R-squared: 0.9855
F-statistic: 2.032e+05 on 1 and 2988 DF, p-value: < 2.2e-16
The plot looks like following:
ggplot(corsen,aes(observed,sensitivity))+geom_point()+geom_smooth(method="lm",aes(color="red"))+
ylab(" Modeled (m)")+xlab("Observed (m)")+
geom_line(data=oneline,aes(x=onelinex,y=oneliney,color="blue"))+
scale_color_manual("",values=c("red","blue"),label=c("1:1 line","Regression Line"))+theme_bw()+theme(legend.position="top")+
coord_cartesian(xlim=c(-0.2,2),ylim=c(-0.2,2))
My question is that if we look closely the data are off from the 1:1 line. How can I find the R-squared relative to the 1:1 line ? Right now the linear model I used is regardless of the line specified. It is purely based on the data provided.
You can calculate the residuals and sum their squares:
resid2 <- with( corsen, sum( sensitivity-observed)^2 ))
If you wanted an R^2 like number I suppose you could calculate:
R2like <- 1 - resid2/ with(corsen, sum( sensitivity^2))
From what I have found searching the web, below is the approach that I would use to perform a polynomial regression of degree 2 on data (this is culled from the web...I don't have access at the moment to the actual commands I performed on my data, but I mimicked this):
Call:
lm(sample1$Population ~ poly(sample1$Year, 2, raw=TRUE))
Residuals:
Min 1Q Median 3Q Max
-46.888 -18.834 -3.159 2.040 86.748
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5263.159 17.655 298.110 < 2e-16 ***
sample1$Year 29.318 3.696 7.933 4.64e-05 ***
I(sample1$Year^2) -10.589 1.323 -8.002 4.36e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 38.76 on 8 degrees of freedom
Multiple R-squared: 0.9407, Adjusted R-squared: 0.9259
F-statistic: 63.48 on 2 and 8 DF, p-value: 1.235e-05
My dataset is a collection of groups of data, each group having 70+ rows corresponding to monthly data measurements of several variables. I need to calculate the regression on each group of data, and find the groups with statistically significant values for the second derivative. I'd like to end up with a data set which contains one row per group_id and one column for each of the data points that make up the summary displayed above.