lines() not properly displaying quadratic fit

lines() not properly displaying quadratic fit - r

I'm simply trying to display the fit I've generated using lm(), but the lines function is giving me a weird result in which there are multiple lines coming out of one point.
Here is my code:
library(ISLR)
data(Wage)
lm.mod<-lm(wage~poly(age, 4), data=Wage)
Wage$lm.fit<-predict(lm.mod, Wage)
plot(Wage$age, Wage$wage)
lines(Wage$age, Wage$lm.fit, col="blue")
I've tried resetting my plot with dev.off(), but I've had no luck. I'm using rStudio. FWIW, the line shows up perfectly fine if I make the regression linear only, but as soon as I make it quadratic or higher (using I(age^2) or poly()), I get a weird graph. Also, the points() function works fine with poly().
Thanks for the help.

Because you forgot to order the points by age first, the lines are going to random ages. This is happening for the linear regression too; he reason it works for lines is because traveling along any set of points along a line...stays on the line!
plot(Wage$age, Wage$wage)
lines(sort(Wage$age), Wage$lm.fit[order(Wage$age)], col = 'blue')
Consider increasing the line width for a better view:
lines(sort(Wage$age), Wage$lm.fit[order(Wage$age)], col = 'blue', lwd = 3)

Just to add another more general tip on plotting model predictions:
An often used strategy is to create a new data set (e.g. newdat) which contains a sequence of values for your predictor variables across a range of possible values. Then use this data to show your predicted values. In this data set, you have a good spread of predictor variable values, but this may not always be the case. With the new data set, you can ensure that your line represents evenly distributed values across the variable's range:
Example
newdat <- data.frame(age=seq(min(Wage$age), max(Wage$age),length=1000))
newdat$pred <- predict(lm.mod, newdata=newdat)
plot(Wage$age, Wage$wage, col=8, ylab="Wage", xlab="Age")
lines(newdat$age, newdat$pred, col="blue", lwd=2)

Related

When using GAM in R, why is the plot for two continuos variables different from the first one plotted with two continuos by categorical?

I'm using GAM in R and I can't understand why the output for two different equations that should give the same plot are not exactly the same.
For example, when using the mpg dataset with a multivariate equation as follows, I get the plot for the additive affect of weight and rpm in hw.mpg. Then, I want to see what happens when I plot the data of rmp by fuel type. This gives me 3 plots, and I expected the first one (weight) to be exactly the same as the one plotted previously without the "by fuel" differentiation. Am I missing something? Then what is the graph 1 in figure 2 showing?
To get figure 1:
par(mfrow=c(1,2))
data(mpg)
mod_hwy1 <- gam(hw.mpg ~ s(weight) + s(rpm), data = mpg, method = "REML")
plot(mod_hwy1)
To get figure 2:
par(mfrow=c(1,3))
mod_hwy2 <- gam(hw.mpg ~ s(weight) + s(rpm, by=fuel), data = mpg, method = "REML")
plot(mod_hwy2)
Using my own data is even more visible that the two graphs are not exactly the same:
Please someone help me understand!

The main problem with your model is that you forgot to include the group means for the levels of fuel. As a result, the smooths, which are centred about the overall mean of the response are having to also model the group means for the levels of fuel.
Fit the model as:
mod_hwy2 <- gam(hw.mpg ~ fuel + # <--- group means
s(weight) + s(rpm, by=fuel),
data = mpg, method = "REML")
Then add in Gregor's point about these effects being conditional upon the other terms in the model and you should be able to understand what's going one and why things change.
And regarding one of your comments; the locations are shown in your plot, look at the label for the y-axis of each plot.

Linear regression with R: How do I get labels on data point in qq plot, scale location plot, Residuals vs Leverage etc

I have a small dataset with EU member states that contains values on their degree of negotiation success and the activity level the member states showed in the negotiations.
I am doing a linear regression with R.
In short the hypothesis is:
The more activity a member state shows, the more success it will have in negotiations.
I played around a lot with the data, transformed it etc.
What I have done so far:
# Stored the dataset from a csv file in object linData
linData = read.csv(file.choose(), sep = ";", encoding = "de_DE.UTF-8")
# As I like to switch variables and test different models, I send the relevant ones to objects x and y.
# So it is easier for me to change it in the future.
x = linData$ALL_Non_Paper_Art.Ann.Recit.Nennung
y = linData$Success_high
# I put the label for each observation in a factor lab
lab = linData$MS_short
# After this I run the linear model
linModel = lm(y~x, data = linData)
summary(linModel)
# I create a simple scatterplot. Here the labels from the factor lab work fine
plot(x, y)
text(x, y, labels=lab, cex= 0.5, pos = 4)
So far so good. Now I want to check for model quality. For visual insepection I found out I can use the command
plot(linModel)
This produces 4 plots in a row:
Residuals vs Fitted
Normal Q-Q
Scale Location
Residuals vs Leverage
As you can see in every picture R marks problematic observations by a number. It would be very convenient if R could just use the column "MS_short" from te dataset and add the label to the marked observations. I am sure this is possible... but how?
I work with R for 2 months now. I found some stuff here and via googe but nothing helped me to solve the problem. I have no one I can ask. This is my 1st post here an stackoverflow.
Thank you in advance
Rainer

With the help of G. Grothendieck I solved the problem.
After entering the R-help of plot, more specific the help for plot and linear regression (plot.lm) with the command
?plot.lm
I read the box with the "arguments and usage" part and identified the labels.id argument AND the id.n argument.
id.n is "number of points to be labelled in each plot, starting with the most extreme."
I needed that. I was interested in the identification of this extreme points. R already marked the 3 most extreme points in all graphics (see initial post) but used the observations numbers and not any useful labels. Any other labelling would mess up the graphics. So, we remember: In my case I want the 3 most extreme values to be labelled.
Now let's add this to the command:
I started the same as above, with a plot of my already computed linear model -> plot(linModel). After that I added "id.n =" and set the value to "3". That looked like that:
plot(linModel, id.n = 3,
So far so good, now R knows what to label BUT still not what should be used as label.
For this we have to add the labels.id to the command.
labels.id is the "vector of labels, from which the labels for extreme points will be chosen."
I assumed that one column in my dataset (NOT the linear model!) has the property of a vector and so I added a comma and then "labels.id =" to the command and typed in the name of my dataset and then the column, so in my case: "linData$MS_short" where linData is the dataset and MS_short the column with the 2 letter string for each member state. The final command looked like this:
plot(linModel, id.n = 3, labels.id = linData$MS_short)
And then it worked (see here). End of story.
Hope this can help some other newbies. Greetings.

How would I split a histogram or plot that show the number of main Principal Components?

I have performed PCA Analysis using the prcomp function apart of the FactoMineR package on quite a substantial dataset of 3000 x 500.
I have tried plotting the main Principal Components that cover up to 100% of cumulative variance proportion with a fviz_eig plot. However, this is a very large plot due to the large dimensions of the dataset. Is there any way in R to split a plot into multiple plots using a for loop or any other way?
Here is a visual of my plot that only cover 80% variance due to the fact it being large. Could I split this plot into 2 plots?
Large Dataset Visualisation
I have tried splitting the plot up using a for loop...
for(i in data[1:20]) {
fviz_eig(data, addlabels = TRUE, ylim = c(0, 30))
}
But this doesn't work.
Edited Reproducible example:
This is only a small reproducible example using an already available dataset in R but I used a similar method for my large dataset. It will show you how the plot actually works.
# Already existing data in R.
install.packages("boot")
library(boot)
data(frets)
frets
dataset_pca <- prcomp(frets)
dataset_pca$x
fviz_eig(dataset_pca, addlabels = TRUE, ylim = c(0, 100))
However, my large dataset has a lot more PCs that this one (possibly 100 or more to cover up to 100% of cumulative variance proportion) and therefore this is why I would like a way to split the single plot into multiple plots for better visualisation.
Update:
I have performed what was said by #G5W below...
data <- prcomp(data, scale = TRUE, center = TRUE)
POEV = data$sdev^2 / sum(data$sdev^2)
barplot(POEV, ylim=c(0,0.22))
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
barplot(POEV[1:40], ylim=c(0,0.22), main="PCs 1 - 40")
text(0.7+(0:6)*1.2, POEV[1:40], labels = round(100*POEV[1:40], 1),
pos=3)
and I have now got a graph as follows...
Graph
But I am finding it difficult getting the labels to appear above each bar. Can someone help or suggest something for this please?

I am not 100% sure what you want as your result,
but I am 100% sure that you need to take more control over
what is being plotted, i.e. do more of it yourself.
So let me show an example of doing that. The frets data
that you used has only 4 dimensions so it is hard to illustrate
what to do with more dimensions, so I will instead use the
nuclear data - also available in the boot package. I am going
to start by reproducing the type of graph that you displayed
and then altering it.
library(boot)
data(nuclear)
N_PCA = prcomp(nuclear)
plot(N_PCA)
The basic plot of a prcomp object is similar to the fviz_eig
plot that you displayed but has three main differences. First,
it is showing the actual variances - not the percent of variance
explained. Second, it does not contain the line that connects
the tops of the bars. Third, it does not have the text labels
that tell the heights of the boxes.
Percent of Variance Explained. The return from prcomp contains
the raw information. str(N_PCA) shows that it has the standard
deviations, not the variances - and we want the proportion of total
variation. So we just create that and plot it.
POEV = N_PCA$sdev^2 / sum(N_PCA$sdev^2)
barplot(POEV, ylim=c(0,0.8))
This addresses the first difference from the fviz_eig plot.
Regarding the line, you can easily add that if you feel you need it,
but I recommend against it. What does that line tell you that you
can't already see from the barplot? If you are concerned about too
much clutter obscuring the information, get rid of the line. But
just in case, you really want it, you can add the line with
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
However, I will leave it out as I just view it as clutter.
Finally, you can add the text with
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
This is also somewhat redundant, but particularly if you change
scales (as I am about to do), it could be helpful for making comparisons.
OK, now that we have the substance of your original graph, it is easy
to separate it into several parts. For my data, the first two bars are
big so the rest are hard to see. In fact, PC's 5-11 show up as zero.
Let's separate out the first 4 and then the rest.
barplot(POEV[1:4], ylim=c(0,0.8), main="PC 1-4")
text(0.7+(0:3)*1.2, POEV[1:4], labels = round(100*POEV[1:4], 1),
pos=3)
barplot(POEV[5:11], ylim=c(0,0.0001), main="PC 5-11")
text(0.7+(0:6)*1.2, POEV[5:11], labels = round(100*POEV[5:11], 4),
pos=3, cex=0.8)
Now we can see that even though PC 5 is much smaller that any of 1-4,
it is a good bit bigger than 6-11.
I don't know what you want to show with your data, but if you
can find an appropriate way to group your components, you can
zoom in on whichever PCs you want.

How to smooth non-linear regression curve in R

So I'm asked to obtain an estimate theta of the variable Length in the MASS package. The code I used is shown below, as well as the resulting curve. Somehow, I don't end up with a smooth curve, but with a very "blocky" one, as well as some lines between points on the curve. Can anyone help me to get a smooth curve?
utils::data(muscle,package = "MASS")
Length.fit<-nls(Length~t1+t2*exp(-Conc/t3),muscle,
start=list(t1=3,t2=-3,t3=1))
plot(Length~Conc,data=muscle)
lines(muscle$Conc, predict(Length.fit))
Image of the plot:
.
Edit: as a follow-up question:
If I want to more accurately predict the curve, I use nonlinear regression to predict the curve for each of the 21 species. This gives me a vector
theta=(T11,T12,...,T21,T22,...,T3).
I can create a for-loop that plots all of the graphs, but like before, I end up with the blocky curve. However, seeing as I have to plot these curves as follows:
for(i in 1:21) {
lines(muscle$Conc,theta[i]+theta[i+21]*
exp(-muscle$Conc/theta[43]), col=color[i])
i = i+1
}
I don't see how I can use the same trick to smooth out these curves, since muscle$Conc still only has 4 values.
Edit 2:
I figured it out, and changed it to the following:
lines(seq(0,4,0.1),theta[i]+theta[i+21]*exp(-seq(0,4,0.1)/theta[43]), col=color[i])

If you look at the output of cbind(muscle$Conc, predict(Length.fit)), you'll see that many points are repeated and that they're not sorted in order of Conc. lines just plots the points in order and connects the points, giving you multiple back-and-forth lines. The code below runs predict on a unique set of ordered values for Conc.
plot(Length ~ Conc,data=muscle)
lines(seq(0,4,0.1),
predict(Length.fit, newdata=data.frame(Conc=seq(0,4,0.1))))

Plot which parameter where in R?

So... I'm looking at an example in a book that goes something like this:
library(daewr)
mod1 <- aov(height ~ time, data=bread)
summary(mod1)
...
par(mfrow=c(2,2))
plot(mod1, which=5)
plot(mod1, which=1)
plot(mod1, which=2)
plot(residuals(mod1) ~ loaf, main="Residuals vs Exp. Units", font.main=1, data=bread)
abline(h = 0, lty = 2)
That all works... but the text is a little vague about the purpose of the parameter 'which='. I dug around in the help (in Rstudio) on plot() and par(), looked around online... found some references to a different 'which()'... but nothing really referring me to the purpose/syntax for the parameter 'which=' inside plot().
A bit later (next page, figures) I found a mention of using names(mod1) to view the list of quantities calculated by aov... which I presume is what which= is refering to, i.e. which item in the list to plot where in the 2x2 matrix of plots. Yay. Now where the heck is that buried in the docs?!?

which selects which plot to be displayed:
A plot of residuals against fitted values
A normal Q-Q plot
A Scale-Location plot of sqrt(| residuals |) against fitted values
A plot of Cook's distances versus row labels
A plot of residuals against leverages
A plot of Cook's distances against leverage/(1-leverage)
By default, the first three and 5 are provided.
Check ?plot.lm in r for more details.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex