Least squares regression line not matching scatterplot - r

I am trying to figure out residual distances for a length-mass association, and am running into an issue where the predicted values line does not match the points on the scatterplot at all, although I believe I am using the correct code. I've attached a picture of the plot I'm getting... any ideas as to what's going wrong?
logTL<-log10(bd.1$TL)
logMass<-log10(bd.1$mass)
#linear relation bt log TL and log mass
lma<-lm(logMass~logTL)
summary(lma)
#plot and fit line to data
plot(logMass, logTL,xlab="log (base 10) total length", ylab="log (base 10) mass")
abline(coefficients(lma))

I think you should exchange the order of logTL and logMass in your plot, i.e.,
plot(logTL, logMass, xlab="log (base 10) total length", ylab="log (base 10) mass")
since you did regression of logMass with respect to logTL, i.e., lma<-lm(logMass~logTL)
Otherwise, you can adapt it in another way, i.e., abline(1/coefficients(lma)).

Related

kernel density estimator in R

I'm using the last column from the following data,
Data
And I'm trying to apply the idea of a kernel density estimator to this dataset which is represented by
where k is some kernal, normally a normal distribution though not necessarily., h is the bandwidth, n is the length of the data set, X_i is each data point and x is a fitted value. So using this equation I have the following code,
AstroData=read.table(paste0("http://www.stat.cmu.edu/%7Elarry",
"/all-of-nonpar/=data/galaxy.dat"),
header=FALSE)
x=AstroData$V3
xsorted=sort(x)
x_i=xsorted[1:1266]
hist(x_i, nclass=308)
n=length(x_i)
h1=.002
t=seq(min(x_i),max(x_i),0.01)
M=length(t)
fhat1=rep(0,M)
for (i in 1:M){
fhat1[i]=sum(dnorm((t[i]-x_i)/h1))/(n*h1)}
lines(t, fhat1, lwd=2, col="red")
Which produces a the following plot,
which is actually close to what I want as the final result should appear as this once I remove the histograms,
Which if you noticed is finer tuned and the red lines which should represent the density are rather rough and are not scaled as high. The final plot that you see is run using the density function in R,
plot(density(x=y, bw=.002))
Which is what I want to get to without having to use any additional packages.
Thank you
After some talk with my roommate he gave me the idea to go ahead and decrease the interval of the t-values (x). In doing some I changed it from 0.01 to 0.001. So the final code for this plot is as appears,
AstroData=read.table(paste0("http://www.stat.cmu.edu/%7Elarry",
"/all-of-nonpar/=data/galaxy.dat"),
header=FALSE)
x=AstroData$V3
xsorted=sort(x)
x_i=xsorted[1:1266]
hist(x_i, nclass=308)
n=length(x_i)
h1=.002
t=seq(min(x_i),max(x_i),0.001)
M=length(t)
fhat1=rep(0,M)
for (i in 1:M){
fhat1[i]=sum(dnorm((t[i]-x_i)/h1))/(n*h1)}
lines(t, fhat1, lwd=2, col="blue")
Which in terms gives the following plot, which is the one that I wanted,

Problems plotting exponential formula

I am trying to make an exponential plot of a variable. The coefficient of the variable is very high (350 million) from the GLM results. With other variables that had lower coefficients, I was able to plot them easily with no issues. I have been trying to set the sequence interval smaller and smaller but it keeps crashing r when I try to plot it.
Any suggestions? I have tried breaking up the data already with no luck.
My vectors are very large numerics as well (18Mb).
chlautcnod<-seq.int(0, 2.45259, 0.000001)
chlautcnodline<- glmnodosaALL$coefficients[1] +
glmnodosaALL$coefficients[2]*mean(bornodosaAP$Chl_spring) +
glmnodosaALL$coefficients[3]*chlautcnod + glmnodosaALL$coefficients[4]*mean(bornodosaAP$Dist_coast) +
glmnodosaALL$coefficients[5]*mean(bornodosaAP$Chl_winter)+ glmnodosaALL$coefficients[6]*mean(bornodosaAP$Depth) +
glmnodosaALL$coefficients[7]*mean(bornodosaAP$Chl_yr_avg)+ glmnodosaALL$coefficients[8]*mean(bornodosaAP$Dist_complete_river) +
glmnodosaALL$coefficients[9]*mean(bornodosaAP$Temp_yr_min)+ glmnodosaALL$coefficients[10]*mean(bornodosaAP$Chl_summer)+
glmnodosaALL$coefficients[11]*mean(bornodosaAP$Chl_yr_max)+ glmnodosaALL$coefficients[12]*mean(bornodosaAP$SWH_summer)+
glmnodosaALL$coefficients[13]*mean(bornodosaAP$SWH_yr_min)+ glmnodosaALL$coefficients[14]*mean(bornodosaAP$SWH_spring)
gc(plot(exp(1)^chlautcnodline~chlautcnod, xlab = (expression(paste("Chlorophyll-α Autumn (mg/m"^"3"~")"))), ylab= "Probability of C. nodosa occurance",ylim=c(0,0.05),xlim=c(0.15,0.17), type="l", bty="l")

Plot which parameter where in R?

So... I'm looking at an example in a book that goes something like this:
library(daewr)
mod1 <- aov(height ~ time, data=bread)
summary(mod1)
...
par(mfrow=c(2,2))
plot(mod1, which=5)
plot(mod1, which=1)
plot(mod1, which=2)
plot(residuals(mod1) ~ loaf, main="Residuals vs Exp. Units", font.main=1, data=bread)
abline(h = 0, lty = 2)
That all works... but the text is a little vague about the purpose of the parameter 'which='. I dug around in the help (in Rstudio) on plot() and par(), looked around online... found some references to a different 'which()'... but nothing really referring me to the purpose/syntax for the parameter 'which=' inside plot().
A bit later (next page, figures) I found a mention of using names(mod1) to view the list of quantities calculated by aov... which I presume is what which= is refering to, i.e. which item in the list to plot where in the 2x2 matrix of plots. Yay. Now where the heck is that buried in the docs?!?
which selects which plot to be displayed:
A plot of residuals against fitted values
A normal Q-Q plot
A Scale-Location plot of sqrt(| residuals |) against fitted values
A plot of Cook's distances versus row labels
A plot of residuals against leverages
A plot of Cook's distances against leverage/(1-leverage)
By default, the first three and 5 are provided.
Check ?plot.lm in r for more details.

lines() not properly displaying quadratic fit

I'm simply trying to display the fit I've generated using lm(), but the lines function is giving me a weird result in which there are multiple lines coming out of one point.
Here is my code:
library(ISLR)
data(Wage)
lm.mod<-lm(wage~poly(age, 4), data=Wage)
Wage$lm.fit<-predict(lm.mod, Wage)
plot(Wage$age, Wage$wage)
lines(Wage$age, Wage$lm.fit, col="blue")
I've tried resetting my plot with dev.off(), but I've had no luck. I'm using rStudio. FWIW, the line shows up perfectly fine if I make the regression linear only, but as soon as I make it quadratic or higher (using I(age^2) or poly()), I get a weird graph. Also, the points() function works fine with poly().
Thanks for the help.
Because you forgot to order the points by age first, the lines are going to random ages. This is happening for the linear regression too; he reason it works for lines is because traveling along any set of points along a line...stays on the line!
plot(Wage$age, Wage$wage)
lines(sort(Wage$age), Wage$lm.fit[order(Wage$age)], col = 'blue')
Consider increasing the line width for a better view:
lines(sort(Wage$age), Wage$lm.fit[order(Wage$age)], col = 'blue', lwd = 3)
Just to add another more general tip on plotting model predictions:
An often used strategy is to create a new data set (e.g. newdat) which contains a sequence of values for your predictor variables across a range of possible values. Then use this data to show your predicted values. In this data set, you have a good spread of predictor variable values, but this may not always be the case. With the new data set, you can ensure that your line represents evenly distributed values across the variable's range:
Example
newdat <- data.frame(age=seq(min(Wage$age), max(Wage$age),length=1000))
newdat$pred <- predict(lm.mod, newdata=newdat)
plot(Wage$age, Wage$wage, col=8, ylab="Wage", xlab="Age")
lines(newdat$age, newdat$pred, col="blue", lwd=2)

How to plot exponential function on barplot R?

So I have a barplot in which the y axis is the log (frequencies). From just eyeing it, it appears that bars decrease exponentially, but I would like to know this for sure. What I want to do is also plot an exponential on this same graph. Thus, if my bars fall below the exponential, I would know that my bars to decrease either exponentially or faster than exponential, and if the bars lie on top of the exponential, I would know that they dont decrease exponentially. How do I plot an exponential on a bar graph?
Here is my graph if that helps:
If you're trying to fit density of an exponential function, you should probably plot density histogram (not frequency). See this question on how to plot distributions in R.
This is how I would do it.
x.gen <- rexp(1000, rate = 3)
hist(x.gen, prob = TRUE)
library(MASS)
x.est <- fitdistr(x.gen, "exponential")$estimate
curve(dexp(x, rate = x.est), add = TRUE, col = "red", lwd = 2)
One way of visually inspecting if two distributions are the same is with a Quantile-Quantile plot, or Q-Q plot for short. Typically this is done when inspecting if a distribution follows standard normal.
The basic idea is to plot your data, against some theoretical quantiles, and if it matches that distribution, you will see a straight line. For example:
x <- qnorm(seq(0,1,l=1002)) # Theoretical normal quantiles
x <- x[-c(1, length(x))] # Drop ends because they are -Inf and Inf
y <- rnorm(1000) # Actual data. 1000 points drawn from a normal distribution
l.1 <- lm(sort(y)~sort(x))
qqplot(x, y, xlab="Theoretical Quantiles", ylab="Actual Quantiles")
abline(coef(l.1)[1], coef(l.1)[2])
Under perfect conditions you should see a straight line when plotting the theoretical quantiles against your data. So you can do the same plotting your data against the exponential function you think it will follow.

Resources