Strange abline behavior when inverting X and Y - r

I'm trying to do a regression line with 2 variables, WMC and BUG
When BUG is the X axis, the regression line seems perfect.
However, when BUG is the Y axis and WMC the X axis, the line behaves strangely, it doesn't seem to fit the plot at all. What am I doing wrong?
reg1 <- lm (WMC ~ BUG)
plot(BUG,WMC)
abline(reg1)
reg1 <- lm (BUG ~ WMC)
plot(WMC,BUG)
abline(reg1)
Yeah, I'm a stats noob.

Ignoring the rationale for your models at the moment: there is nothing wrong with how the lines are plotted. The reason why the lines seem to fit differently to the plot is because you are estimating two different models.
So let's start with your first model where you regress wmc on bug
m1<-lm(wmc~bug,df);summary(m1)
Call:
lm(formula = wmc ~ bug, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.17555 -0.55069 0.00892 0.46091 2.23740
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.93699 0.03057 63.37 <2e-16***
bug 0.84808 0.05926 14.31 <2e-16***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7508 on 743 degrees of freedom
Multiple R-squared: 0.2161, Adjusted R-squared: 0.215
F-statistic: 204.8 on 1 and 743 DF, p-value: < 2.2e-16
This model tells us that when regressing wmc on bug that a unit increase in bug corresponds with a 0.85 increase in wmc (I don't know what the original units of your measurement are). This is reflected in the plot if you look at the intercept value and the slope of the line over one unit increase of bug:
Now in the second model you do the opposite. You regress bug on wmc.
m2<-lm(bug~wmc,df);summary(m2)
Call:
lm(formula = bug ~ wmc, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.74635 -0.26947 -0.09287 0.14058 1.67470
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.31717 0.04077 -7.779 2.44e-14***
wmc 0.25477 0.01780 14.310 < 2e-16***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4115 on 743 degrees of freedom
Multiple R-squared: 0.2161, Adjusted R-squared: 0.215
F-statistic: 204.8 on 1 and 743 DF, p-value: < 2.2e-16
So in this case a one unit increase in wmc corresponds with a 0.25 increase in bug. Which is also reflected in the plot, notice again the value of the intercept and slope of the line with regard to a one unit increase in wmc in this case.

Related

Swap x and y variables in lm() function in R

Trying to get a summary output of the linear model created with the lm() function in R but no matter which way I set it up I get my desired y value as my x input. My desired output is the summary of the model where Winnings is the y output and averagedist is the input. This is my current output:
Call:
lm(formula = Winnings ~ averagedist, data = combineddata)
Residuals:
Min 1Q Median 3Q Max
-20.4978 -5.2992 -0.3824 6.0887 23.4764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.882e+02 7.577e-01 380.281 < 2e-16 ***
Winnings 1.293e-06 2.023e-07 6.391 8.97e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.343 on 232 degrees of freedom
Multiple R-squared: 0.1497, Adjusted R-squared: 0.146
F-statistic: 40.84 on 1 and 232 DF, p-value: 8.967e-10
Have tried flipping the order and defining the variables using y = winnings, x = averagedist but always get the same output.
Used summary(lm(Winnings ~ averagedist,combineddata)) as an alternative way to set it up and that seemed to do the trick, as opposed the two step method:
str<-lm(Winnings ~ averagedist,combineddata)
summary(str)

linear regression r comparing multiple observations vs single observation

Based upon answers of my question, I am supposed to get same values of intercept and the regression coefficient for below 2 models. But they are not the same. What is going on?
is something wrong with my code? Or is the original answer wrong?
#linear regression average qty per price point vs all quantities
x1=rnorm(30,20,1);y1=rep(3,30)
x2=rnorm(30,17,1.5);y2=rep(4,30)
x3=rnorm(30,12,2);y3=rep(4.5,30)
x4=rnorm(30,6,3);y4=rep(5.5,30)
x=c(x1,x2,x3,x4)
y=c(y1,y2,y3,y4)
plot(y,x)
cor(y,x)
fit=lm(x~y)
attributes(fit)
summary(fit)
xdum=c(20,17,12,6)
ydum=c(3,4,4.5,5.5)
plot(ydum,xdum)
cor(ydum,xdum)
fit1=lm(xdum~ydum)
attributes(fit1)
summary(fit1)
> summary(fit)
Call:
lm(formula = x ~ y)
Residuals:
Min 1Q Median 3Q Max
-8.3572 -1.6069 -0.1007 2.0222 6.4904
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.0952 1.1570 34.65 <2e-16 ***
y -6.1932 0.2663 -23.25 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.63 on 118 degrees of freedom
Multiple R-squared: 0.8209, Adjusted R-squared: 0.8194
F-statistic: 540.8 on 1 and 118 DF, p-value: < 2.2e-16
> summary(fit1)
Call:
lm(formula = xdum ~ ydum)
Residuals:
1 2 3 4
-0.9615 1.8077 -0.3077 -0.5385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.2692 3.6456 10.497 0.00895 **
ydum -5.7692 0.8391 -6.875 0.02051 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.513 on 2 degrees of freedom
Multiple R-squared: 0.9594, Adjusted R-squared: 0.9391
F-statistic: 47.27 on 1 and 2 DF, p-value: 0.02051
You are not calculating xdum and ydum in a comparable fashion because rnorm will only approximate the mean value you specify, particularly when you are sampling only 30 cases. This is easily fixed however:
coef(fit)
#(Intercept) y
# 39.618472 -6.128739
xdum <- c(mean(x1),mean(x2),mean(x3),mean(x4))
ydum <- c(mean(y1),mean(y2),mean(y3),mean(y4))
coef(lm(xdum~ydum))
#(Intercept) ydum
# 39.618472 -6.128739
In theory they should be the same if (and only if) the mean of the former model is equal to the point in the latter model.
This is not the case in your models, so the results are slightly different. For example the mean of x1:
x1=rnorm(30,20,1)
mean(x1)
20.08353
where the point version is 20.
There are similar tiny differences from your other rnorm samples:
> mean(x2)
[1] 17.0451
> mean(x3)
[1] 11.72307
> mean(x4)
[1] 5.913274
Not that this really matters, but just FYI the standard nomenclature is that Y is the dependent variable and X is the independent variable, which you reversed. Makes no difference of course, but just so you know.

R: test quadratic regression with interaction

I have data from an experiment with two conditions (dichotomous IV: 'condition'). I also want to make use of another IV which is metric ('hh'). My DV is also metric ('attention.hh'). I've already run a multiple regression model with an interaction of my IVs. Therefore, I centered the metric IV by doing this:
hh.cen <- as.numeric(scale(data$hh, scale = FALSE))
with these variables I ran the following analysis:
model.hh <- lm(attention.hh ~ hh.cen * condition, data = data)
summary(model.hh)
The results are as follows:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.04309 3.83335 0.011 0.991
hh.cen 4.97842 7.80610 0.638 0.525
condition 4.70662 5.63801 0.835 0.406
hh.cen:condition -13.83022 11.06636 -1.250 0.215
However, the theory behind my analysis tells me, that I should expect a quadratic relation of my metric IV (hh) and the DV (but only in one condition).
Looking at the plot, one could at least imply this relation:
Of course I want to test this statistically. However, I'm struggling now how to compute the lineare regression model.
I have two solutions I think that should be good, leading to different outcomes. Unfortunately, I don't know which is the right one now. I know, that by including interactions (and 3-way interactions) into the model, I also have to include all simple/main effects as well.
Solution: Including all terms on their own:
therefore I first compute the squared IV:
attention.hh.cen <- scale(data$attention.hh, scale = FALSE)
now i can compute the linear model:
sqr.model.1 <- lm(attention.hh.cen ~ condition + hh.cen + hh.sqr + (condition : hh.cen) + (condition : hh.sqr) , data = data)
summary(sqr.model.1)
This leads to the following outcome:
Call:
lm(formula = attention.hh.cen ~ condition + hh.cen + hh.sqr +
(condition:hh.cen) + (condition:hh.sqr), data = data)
Residuals:
Min 1Q Median 3Q Max
-53.798 -14.527 2.912 13.111 49.119
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3475 3.5312 -0.382 0.7037
condition -9.2184 5.6590 -1.629 0.1069
hh.cen 4.0816 6.0200 0.678 0.4996
hh.sqr 5.0555 8.1614 0.619 0.5372
condition:hh.cen -0.3563 8.6864 -0.041 0.9674
condition:hh.sqr 33.5489 13.6448 2.459 0.0159 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.77 on 87 degrees of freedom
Multiple R-squared: 0.1335, Adjusted R-squared: 0.08365
F-statistic: 2.68 on 5 and 87 DF, p-value: 0.02664
Solution: R includes all main effects of an interaction by using the *
sqr.model.2 <- lm(attention.hh.cen ~ condition * I(hh.cen^2), data = data)
summary(sqr.model.2)
IMHO, this should also be fine -- however, the output is not the same as the one received by the code above
Call:
lm(formula = attention.hh.cen ~ condition * I(hh.cen^2), data = data)
Residuals:
Min 1Q Median 3Q Max
-52.297 -13.353 2.508 12.504 49.740
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.300 3.507 -0.371 0.7117
condition -8.672 5.532 -1.567 0.1206
I(hh.cen^2) 4.490 8.064 0.557 0.5791
condition:I(hh.cen^2) 32.315 13.190 2.450 0.0162 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.64 on 89 degrees of freedom
Multiple R-squared: 0.1254, Adjusted R-squared: 0.09587
F-statistic: 4.252 on 3 and 89 DF, p-value: 0.007431
I'd rather go with solution number 1 but I'm not sure about that.
Maybe someone has a better solution or can help me out?

Interpreting Summary Statistics with Categorical Variables [duplicate]

This question already has an answer here:
lm function in R does not give coefficients for all factor levels in categorical data
(1 answer)
Closed 6 years ago.
With this output, I know that intercept is when both factors are 0. I understand that factor(V1)1 means V1=1 and factor(V2)1 means V2=1. To get the slope for just V1 being = 1 I would add 5.1122 +(-0.4044). However, I am wondering how to interpret the p-values in this output. If just V1 = 1, does that mean the p-value is 2.39e-12 + 0.376? If so, every model I run is only significant when all factors = 0...
> lm.comfortgender=lm(V13~factor(V1)+factor(V2),data=comfort.txt)
> summary(lm.comfortgender)
Call:
lm(formula = V13 ~ factor(V1) + factor(V2), data = comfort.txt)
Residuals:
Min 1Q Median 3Q Max
-3.5676 -1.0411 0.1701 1.4324 2.0590
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1122 0.5244 9.748 2.39e-12 ***
factor(V1)1 -0.4044 0.4516 -0.895 0.376
factor(V2)1 0.2332 0.5105 0.457 0.650
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.487 on 42 degrees of freedom
Multiple R-squared: 0.02793, Adjusted R-squared: -0.01836
F-statistic: 0.6033 on 2 and 42 DF, p-value: 0.5517
The p-values given as output in the R regression models test the null hypothesis that the mean of the distribution of that particular coefficient is zero, under the assumption that the distribution is normal and the standard deviation is the square root of the variance. Refer to this other answer for further clarifications.

R-squared for observed and modeled relative to 1:1 line in R

This question may sound a little bit weird. I want to know how can I report the R-sqaured value in R compared to 1:1 line. For example I want to compare the observed and modeled values. In the ideal case, it should be a straight line passing through an origin at an angle of 45 degrees.
For example I have a data which can be found on https://www.dropbox.com/s/71u2vsgt7p9k5cl/correlationcsv
The code I wrote is as follows:
> corsen<- read.table("Sensitivity Runs/correlationcsv",sep=",",header=T)
> linsensitivity <- lm(data=corsen,sensitivity~0+observed)
> summary(linsensitivity)
Call:
lm(formula = sensitivity ~ 0 + observed, data = corsen)
Residuals:
Min 1Q Median 3Q Max
-0.37615 -0.03376 0.00515 0.04155 0.27213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
observed 0.833660 0.001849 450.8 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05882 on 2988 degrees of freedom
Multiple R-squared: 0.9855, Adjusted R-squared: 0.9855
F-statistic: 2.032e+05 on 1 and 2988 DF, p-value: < 2.2e-16
The plot looks like following:
ggplot(corsen,aes(observed,sensitivity))+geom_point()+geom_smooth(method="lm",aes(color="red"))+
ylab(" Modeled (m)")+xlab("Observed (m)")+
geom_line(data=oneline,aes(x=onelinex,y=oneliney,color="blue"))+
scale_color_manual("",values=c("red","blue"),label=c("1:1 line","Regression Line"))+theme_bw()+theme(legend.position="top")+
coord_cartesian(xlim=c(-0.2,2),ylim=c(-0.2,2))
My question is that if we look closely the data are off from the 1:1 line. How can I find the R-squared relative to the 1:1 line ? Right now the linear model I used is regardless of the line specified. It is purely based on the data provided.
You can calculate the residuals and sum their squares:
resid2 <- with( corsen, sum( sensitivity-observed)^2 ))
If you wanted an R^2 like number I suppose you could calculate:
R2like <- 1 - resid2/ with(corsen, sum( sensitivity^2))

Resources