Linear regression different using R plot() and qplot() - r

If I create a scatterplot using plot() with lm(x~y) on my data I get intercept at 500 and when I observe the qplot on the same data with stat_smooth(method=lm), the intercept is at roughly 1000 on y axis. Although the slope looks visually similar to that on the simple plot(). I hope this makes sense. I cannot understand why the difference. Full functions are given below. Any help will be greatly appreciated.
plot():
plot (my[[12]],my[[8]])
abline(lm(my[[12]]~my[[8]]),col="red")
qplot():
myGG<-qplot(x=my[[12]],y=my[[8]]) # pretty scatterplot
myGG<-myGG + stat_smooth(fullrange=TRUE,method="lm")

It seems to me that the variables in the regressions do not correspond. In lm the variable my[[12]] is dependent, in the qplot variant it is the independent one. Using lm(my[[8]]~my[[12]] should make it equivalent.
It is a common mistake to mix up the variables when using plot and lm. Note that to get the axis right, the order of the variables changes in lm compared to plot.
x <- rnorm(100)
y <- rnorm(100)
plot(x,y)
abline(lm(y ~x))
To make it less confusing you might use the formula interface in plot as well.
plot(y ~ x)
abline(lm(y ~x))

Related

When using GAM in R, why is the plot for two continuos variables different from the first one plotted with two continuos by categorical?

I'm using GAM in R and I can't understand why the output for two different equations that should give the same plot are not exactly the same.
For example, when using the mpg dataset with a multivariate equation as follows, I get the plot for the additive affect of weight and rpm in hw.mpg. Then, I want to see what happens when I plot the data of rmp by fuel type. This gives me 3 plots, and I expected the first one (weight) to be exactly the same as the one plotted previously without the "by fuel" differentiation. Am I missing something? Then what is the graph 1 in figure 2 showing?
To get figure 1:
par(mfrow=c(1,2))
data(mpg)
mod_hwy1 <- gam(hw.mpg ~ s(weight) + s(rpm), data = mpg, method = "REML")
plot(mod_hwy1)
To get figure 2:
par(mfrow=c(1,3))
mod_hwy2 <- gam(hw.mpg ~ s(weight) + s(rpm, by=fuel), data = mpg, method = "REML")
plot(mod_hwy2)
Using my own data is even more visible that the two graphs are not exactly the same:
Please someone help me understand!
The main problem with your model is that you forgot to include the group means for the levels of fuel. As a result, the smooths, which are centred about the overall mean of the response are having to also model the group means for the levels of fuel.
Fit the model as:
mod_hwy2 <- gam(hw.mpg ~ fuel + # <--- group means
s(weight) + s(rpm, by=fuel),
data = mpg, method = "REML")
Then add in Gregor's point about these effects being conditional upon the other terms in the model and you should be able to understand what's going one and why things change.
And regarding one of your comments; the locations are shown in your plot, look at the label for the y-axis of each plot.

R: Best fit for data (Exponential or Power), with curve predicted beyond final data point

So, I am challenged and request a little guidance.
I have used the rriskDistributions package to evaluate some CDFs for some industrial sector injury data with the get.lnorm.par() function. It fits the data great, unfortunately, the axes require swapping because my response variable is currently on the x-axis, and needs to be on the y-axis. Unfortunately again, the get.lnorm.par() function requires that the probabilities be only on the y-axis, and I cannot figure out how to create the same curve with swapped axes.
I want to get it to look something like this:
An example of the code that I have worked through in ggplot follows:
x <- c(0.0416988,0.0656371,0.1015444,0.1270270,0.1536680,0.1694981,0.2509653)
y <- c(3170221,6810103,14999840,26623982,48903587,74177290,266181110)
prob <- c(x) ## There are 389 different x values, but keeping it simple!
quant <- c(y) ## Same as x.
df1 <- data.frame(prob,quant)
plot2 <- ggplot(df1, aes(x=prob, y=quant)) + geom_point() +
geom_smooth(method="lm", formula= log(y)~x, se=FALSE) +
labs(y="quantiles", x="probabilities", title="Probs vs Quants")
plot2
I have created lines that fit this data, but everything ends at the last data point.
When I used get.lnorm.par(), the fit was great, but like stated previously, the axes require flipping. When I tried this, I continued to get errors about infinite output and could not define the bounds of the function to be plotted.
So, here is the code using the rriskDistributions package:
pct <- c(0.0416988,0.0656371,0.1015444,0.1270270,0.1536680,0.1694981,0.2509653)
my.lnorm<-get.lnorm.par(p=pct, q=c(3170221,6810103,14999840,26623982,48903587,74177290,266181110),
tol = 0.001, scaleX = c(0,0.0809))
Essentially, I am trying to create a fit curve for the data (either exponential or power) that expands, or predicts beyond the final data point. This I cannot figure out for the life of me, and changing any of the parameters in the rriskDistributions functions is quite challenging.
Any thoughts?
Thanks.

R: Lattice Q-Q Plot with regression Line

I can create a lattice qq-plot with:
qqnorm(surfchem.cast$Con)
but I have not learned how to add a panel.abline or prepane.qqmathline().
I've looked in the lattice graphics book and searched the web without finding the correct syntax. A pointer to how to add this line representing the linear relationship between theoretical and data quantiles will be greatly appreciated. I also do not find a question here where the answer is for a qq plot rather than an xyplot.
The convention with Q-Q plots is to plot the line that goes through the first and fourth quartiles of the sample and the test distribution, not the line of best fit.
set.seed(1)
Z <- rnorm(100)
qqnorm(Z)
qqline(Z,probs=c(0.25,0.75))
The reason for this is that, if your sample is not normally distributed, the deviations tend to be at the extremes.
set.seed(1)
Z <- runif(100) # NOTE: uniform distribution...
qqnorm(Z)
qqline(Z, probs=c(0.25,0.75))
If you want the line connecting the corners, as in your comment, use different probabilities. The reason you need to use (0.01,0.99) rather than (0,1) is that the latter will produce infinities.
set.seed(1)
Z <- runif(100) # NOTE: uniform distribution...
qqnorm(Z)
qqline(Z, probs=c(0.01,0.99))

Add line/equation to scatter plot

I have 3 models, all of which are significant and I want to create a linear graph with my data. This is what I have so far:
>morpho<-read.table("C:\\Users\\Jess\\Dropbox\\Monochamus\\Morphometrics.csv",header=T,sep=",")
> attach(morpho)
> wtpro<-lm(weight~pronotum)
> plot(weight,pronotum)
> abline(wtpro)
I have tried entering the abline as:
abline(lm(weight~pronotum))
I can't figure out what I'm doing wrong. I want to add my equation, I have all of my coefficients but can't get past the line...I have even started over thinking maybe I messed up along the way and it still will not work. Is there a separate package that I am missing?
Try:
abline(coef(lm(weight~pronotum)) # works if dataframe is attached.
I try to avoid attach(). It creates all sorts of anomalies that increase as you do more regression work. Better would be:
wtpro<-lm(weight~pronotum, data= morpho)
with( morpho , plot(weight,pronotum) )
abline( coef(wtpro) )
Plot is in the format plot(x, y, ...) and it looks like you've ordered your dependent variable first. Easy mistake to make.
For example:
Set up some data
y <- rnorm(10)
x <- rnorm(10) + 5
A plot with the dependent variable placed on the x axis will not display the regression line as it's outside of the visible plane.
plot(y,x)
abline(lm(y~x), col='red', main='Check the axis labels')
Flip the variables in the plot command. Now it will be visible.
plot(x,y)
abline(lm(y~x), col='red', main='Check the axis labels')

ggplot2 2d Density Weights

I'm trying to plot some data with 2d density contours using ggplot2 in R.
I'm getting one slightly odd result.
First I set up my ggplot object:
p <- ggplot(data, aes(x=Distance,y=Rate, colour = Company))
I then plot this with geom_points and geom_density2d. I want geom_density2d to be weighted based on the organisation's size (OrgSize variable). However when I add OrgSize as a weighting variable nothing changes in the plot:
This:
p+geom_point()+geom_density2d()
Gives an identical plot to this:
p+geom_point()+geom_density2d(aes(weight = OrgSize))
However, if I do the same with a loess line using geom_smooth, the weighting does make a clear difference.
This:
p+geom_point()+geom_smooth()
Gives a different plot to this:
p+geom_point()+geom_smooth(aes(weight=OrgSize))
I was wondering if I'm using density2d inappropriately, should I instead be using contour and supplying OrgSize as the 'height'? If so then why does geom_density2d accept a weighting factor?
Code below:
require(ggplot2)
Company <- c("One","One","One","One","One","Two","Two","Two","Two","Two")
Store <- c(1,2,3,4,5,6,7,8,9,10)
Distance <- c(1.5,1.6,1.8,5.8,4.2,4.3,6.5,4.9,7.4,7.2)
Rate <- c(0.1,0.3,0.2,0.4,0.4,0.5,0.6,0.7,0.8,0.9)
OrgSize <- c(500,1000,200,300,1500,800,50,1000,75,800)
data <- data.frame(Company,Store,Distance,Rate,OrgSize)
p <- ggplot(data, aes(x=Distance,y=Rate))
# Difference is apparent between these two
p+geom_point()+geom_smooth()
p+geom_point()+geom_smooth(aes(weight = OrgSize))
# Difference is not apparent between these two
p+geom_point()+geom_density2d()
p+geom_point()+geom_density2d(aes(weight = OrgSize))
geom_density2d is "accepting" the weight parameter, but then not passing to MASS::kde2d, since that function has no weights. As a consequence, you will need to use a different 2d-density method.
(I realize my answer is not addressing why the help page says that geom_density2d "understands" the weight argument, but when I have tried to calculate weighted 2D-KDEs, I have needed to use other packages besides MASS. Maybe this is a TODO that #hadley put in the help page that then got overlooked?)

Resources