Fixing bug in geom_smooth method in R - r

I want to create a smooth curve plot for the data I have. I have data in a text file, say file.txt which is a tab seperated file and headers are A and B like
A B
0.1 0.2
.....
.....
There are about 30000 such data points under both A and B
I am using the following code for that:
dstr_data <- read.table("file.txt", header=T, sep="\t")
ggplot(dstr_data,aes(xaxis))+geom_smooth(method="auto",aes(y=dstr_data$A)
,colour="red",size=0.75)+geom_smooth(method="auto",aes(y=dstr_data$B),
colour="darkgreen",alpha=0.5,size=0.75)+opts(title=expression("Test Plot"),
panel.background = theme_rect(fill='blanchedalmond', colour='black'))+
xlab("Data")+ylab("Values")
geom_smooth: method="auto" and size of largest group is >=1000,
so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
xaxis in my code holds numbers from 1 to 30000. So my X-axis would be numbers from 1 to 30000. Y-axis would be values from the file.txt. So, I am trying to plot two curves on one graph now.
I want to know why this Error is displayed and how I can fix it. I want to use a method that gives me a smooth curve of the data and not a straight line and hence I do not want to use the lm,glm methods.
Also I get the graph only for a subset of data and not the entire data. Why does this happen?
Can someone help me in this? Thank you in advance.

Related

When using GAM in R, why is the plot for two continuos variables different from the first one plotted with two continuos by categorical?

I'm using GAM in R and I can't understand why the output for two different equations that should give the same plot are not exactly the same.
For example, when using the mpg dataset with a multivariate equation as follows, I get the plot for the additive affect of weight and rpm in hw.mpg. Then, I want to see what happens when I plot the data of rmp by fuel type. This gives me 3 plots, and I expected the first one (weight) to be exactly the same as the one plotted previously without the "by fuel" differentiation. Am I missing something? Then what is the graph 1 in figure 2 showing?
To get figure 1:
par(mfrow=c(1,2))
data(mpg)
mod_hwy1 <- gam(hw.mpg ~ s(weight) + s(rpm), data = mpg, method = "REML")
plot(mod_hwy1)
To get figure 2:
par(mfrow=c(1,3))
mod_hwy2 <- gam(hw.mpg ~ s(weight) + s(rpm, by=fuel), data = mpg, method = "REML")
plot(mod_hwy2)
Using my own data is even more visible that the two graphs are not exactly the same:
Please someone help me understand!
The main problem with your model is that you forgot to include the group means for the levels of fuel. As a result, the smooths, which are centred about the overall mean of the response are having to also model the group means for the levels of fuel.
Fit the model as:
mod_hwy2 <- gam(hw.mpg ~ fuel + # <--- group means
s(weight) + s(rpm, by=fuel),
data = mpg, method = "REML")
Then add in Gregor's point about these effects being conditional upon the other terms in the model and you should be able to understand what's going one and why things change.
And regarding one of your comments; the locations are shown in your plot, look at the label for the y-axis of each plot.

Linear regression with R: How do I get labels on data point in qq plot, scale location plot, Residuals vs Leverage etc

I have a small dataset with EU member states that contains values on their degree of negotiation success and the activity level the member states showed in the negotiations.
I am doing a linear regression with R.
In short the hypothesis is:
The more activity a member state shows, the more success it will have in negotiations.
I played around a lot with the data, transformed it etc.
What I have done so far:
# Stored the dataset from a csv file in object linData
linData = read.csv(file.choose(), sep = ";", encoding = "de_DE.UTF-8")
# As I like to switch variables and test different models, I send the relevant ones to objects x and y.
# So it is easier for me to change it in the future.
x = linData$ALL_Non_Paper_Art.Ann.Recit.Nennung
y = linData$Success_high
# I put the label for each observation in a factor lab
lab = linData$MS_short
# After this I run the linear model
linModel = lm(y~x, data = linData)
summary(linModel)
# I create a simple scatterplot. Here the labels from the factor lab work fine
plot(x, y)
text(x, y, labels=lab, cex= 0.5, pos = 4)
So far so good. Now I want to check for model quality. For visual insepection I found out I can use the command
plot(linModel)
This produces 4 plots in a row:
Residuals vs Fitted
Normal Q-Q
Scale Location
Residuals vs Leverage
As you can see in every picture R marks problematic observations by a number. It would be very convenient if R could just use the column "MS_short" from te dataset and add the label to the marked observations. I am sure this is possible... but how?
I work with R for 2 months now. I found some stuff here and via googe but nothing helped me to solve the problem. I have no one I can ask. This is my 1st post here an stackoverflow.
Thank you in advance
Rainer
With the help of G. Grothendieck I solved the problem.
After entering the R-help of plot, more specific the help for plot and linear regression (plot.lm) with the command
?plot.lm
I read the box with the "arguments and usage" part and identified the labels.id argument AND the id.n argument.
id.n is "number of points to be labelled in each plot, starting with the most extreme."
I needed that. I was interested in the identification of this extreme points. R already marked the 3 most extreme points in all graphics (see initial post) but used the observations numbers and not any useful labels. Any other labelling would mess up the graphics. So, we remember: In my case I want the 3 most extreme values to be labelled.
Now let's add this to the command:
I started the same as above, with a plot of my already computed linear model -> plot(linModel). After that I added "id.n =" and set the value to "3". That looked like that:
plot(linModel, id.n = 3,
So far so good, now R knows what to label BUT still not what should be used as label.
For this we have to add the labels.id to the command.
labels.id is the "vector of labels, from which the labels for extreme points will be chosen."
I assumed that one column in my dataset (NOT the linear model!) has the property of a vector and so I added a comma and then "labels.id =" to the command and typed in the name of my dataset and then the column, so in my case: "linData$MS_short" where linData is the dataset and MS_short the column with the 2 letter string for each member state. The final command looked like this:
plot(linModel, id.n = 3, labels.id = linData$MS_short)
And then it worked (see here). End of story.
Hope this can help some other newbies. Greetings.

How to solve R linear regression graph problem?

I have a problem with the following graph:
For every value that I put in plot() I get this graph. Does anyone maybe know what it means?
Cor.test works, I got weak correlation.
my code:
cor.test(podatki$v54, podatki$v197, method = c("pearson"),
conf.level = 0.95, use = "all.obs" )
plot(podatki$v54, podatki$v197)
Your graph looks that way because the points are plotted directly on top of one another.
You can use the jitter(...) function to add small amounts of randomness to the data points so they aren't directly on top of one another (it jitters them around so you can see the ones underneath!) Here is an example you can copy and paste:
# create some random numbers to plot. all are values 1-5.
x1 <- sample(c(1:5), 100, replace = TRUE)
x2 <- sample(c(1:5), 100, replace = TRUE)
# plotting without jitter
plot(x1, x2)
# plotting with jitter
plot(jitter(x1), jitter(x2))
jitter(...) changes the values by small amounts so only use the jittered data for plotting, otherwise it will bias your results!

Fit multiple sine waves with linear regression (R)

I'm trying to fit multiple sine waves (three to be precise) to data using the lm function in R. I am able to get a result, but it looks far from correct:
The green line is kinda wobbly, as it should be, but it seems to be only a single sine (with parabola added), and doesn't match the data very well. What am I doing wrong?
The code I used: (The periods are hardcoded, as they were given to me. Also, timetopdh is the time in seconds, heightdh the water level at a certain point in time.)
plot(timetopdh,heightopdh,xlim=topvector, ylim = c(0,270))
period1 <- 545
period2 <- 205
period3 <- 85
sin11 <- sin(2*pi/period1*timetopdh)
sin12 <- cos(2*pi/period1*timetopdh)
sin21 <- sin(2*pi/period2*timetopdh)
sin22 <- cos(2*pi/period2*timetopdh)
sin31 <- sin(2*pi/period3*timetopdh)
sin32 <- cos(2*pi/period3*timetopdh)
lmsinus <- lm(heightopdh ~ poly(timetopdh,2) + sin11 + sin12 + sin21 + sin22 + sin31 + sin32)
fitted_for_lines <- fitted(lmsinus)
pred <- predict(lmsinus, newdata=data.frame(timetopdh = timetopdh))
lines(timetopdh, pred, col=3, lwd=3)
That is pretty well matched, it should not follow the data exactly. A linear model is supposed to create the best fit (of a straight line) that it can which reduces the variation in Y as much as it can. The green line does exactly that for you (though not straight)
You have only one line because you asked for only one line. The time and height were plotted point per point against each other in the order of the list they were stored in and passed into plot.
The line was drawn from the prediction using the supplied data, but the points were drawn from those other two sets; timetopdh,heightopdh. If you had wanted the sinus waves to print, then you need to ask for each one but name using an appropriate graphing method for them, usually a line graph.

Uniform plot points in R -- Research / HW

This is for research I am doing for my Masters Program in Public Health
I am graphing data against each other, a standard x,y type deal, over top of that I am plotting a predicted line. I get what I think to be the most funky looking point/boxplot looking thing ever with an x axis that is half filled out and I don't understand why as I do not call a boxplot function. When I call the plot function it is my understanding that only the points will plot.
The data I am plotting looks like this
TOTAL.LACE | DAYS.TO.FAILURE
9 | 15
16 | 7
... | ...
The range of the TOTAL.LACE is from 0 to 19 and DAYS.TO.FAILURE is 0 - 30
My code is as follows, maybe it is something before the plot but I don't think it is:
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- unique(FAILURE + 1)
# Build a test frame that will predict values of the lace score due to
# a patient being in a state of failure
test <- survreg(Surv(time = DAYS.TO.FAILURE, event = FAILURE) ~ TOTAL.LACE,
dist = "logistic")
pred <- predict(test, type="response") <-- produces numbers from about 14 to 23
summary(pred)
ord <- order(TOTAL.LACE)
tl_ord <- TOTAL.LACE[ord]
pred_ord <- pred[ord]
plot(TOTAL.LACE, DAYS.TO.FAILURE, pch=unique(psymbol)) <-- Produces goofy graph
lines(tl_ord, pred_ord) <-- this produces the line not boxplots
Here is the resulting picture
Not to sure how to proceed from here, this is an off shoot of another problem I had with the same data set at this link here I am not understanding why boxplots are being drawn, the reason being is I did not specifically call the boxplot() command so I don't know why they appeared along with point plots. When I issue the following command: plot(DAYS.TO.FAILURE, TOTAL.LACE) I only get points on the resulting plot like I expected, but when I change the order of what is plotted on x and y the boxplots show up, which to me is unexpected.
Here is a link to sample data that will hopefully help in reproducing the problem as pointed out by #Dwin et all Some Sample Data
Thank you,
Since you don't have a reproducible example, it is a little hard to provide an answer that deals with your situation. Here I generate some vaguely similar-looking data:
set.seed(4)
TOTAL.LACE <- rep(1:19, each=1000)
zero.prob <- rbinom(19000, size=1, prob=.01)
DAYS.TO.FAILURE <- rpois(19000, lambda=15)
DAYS.TO.FAILURE <- ifelse(zero.prob==1, DAYS.TO.FAILURE, 0)
And here is the plot:
First, the problem with some of the categories not being printed on the x-axis is because they don't fit. When you have so many categories, to make them all fit you have to display them in a smaller font. The code to do this is to use cex.axis and set the value <1 (you can read more about this here):
boxplot(DAYS.TO.FAILURE~TOTAL.LACE, cex.axis=.8)
As to the question of why your plot is "goofy" or "funky-looking", it is a bit hard to say, because those terms are rather nebulous. My guess is that you need to more clearly understand how boxplots work, and then understand what these plots are telling you about the distribution of your data. In a boxplot, the midline of the box is the 50th percentile of your data, while the bottom and top of the box are the 25th and 75th percentiles. Typically, the 'whiskers' will extend out to the furthest datapoint that is at most 1.5 times the inter-quartile range beyond the ends of the box. In your case, for the first 9 TOTAL.LACEs, more than 75% of your data are 0's, so there is no box and thus no whiskers are possible. Everything beyond the whisker limits is plotted as an individual point. I don't think your plots are "funky" (although I'll admit I have no idea what you mean by that), I think your data may be "funky" and your boxplots are representing the distributions of your data accurately according to the rules by which boxplots are constructed.
In the future (and I mean this politely), it will help you get more useful and faster answers if you can write questions that are more clearly specified, and contain a reproducible example.
Update: Thanks for providing more information. I gather by "funky" you mean that it is a boxplot, rather than a typical scatterplot. The thing to realize is that plot() is a generic function that will call different methods depending on what you pass to it. If you pass simple continuous data, it will produce a scatterplot, but if you pass continuous data and a factor, then it will produce a boxplot, even if you don't call boxplot explicitly. Consider:
plot(TOTAL.LACE, DAYS.TO.FAILURE)
plot(as.factor(TOTAL.LACE), DAYS.TO.FAILURE)
Evidently, you have converted DAYS.TO.FAILURE to a factor without meaning to. Presumably this was done in the pch=unique(psymbol) argument via the code psymbol <- unique(FAILURE + 1) above. Although I haven't had time to try this, I suspect eliminating that line of code and using pch=(FAILURE + 1) will accomplish your goals.

Resources