I'm trying to fit multiple sine waves (three to be precise) to data using the lm function in R. I am able to get a result, but it looks far from correct:
The green line is kinda wobbly, as it should be, but it seems to be only a single sine (with parabola added), and doesn't match the data very well. What am I doing wrong?
The code I used: (The periods are hardcoded, as they were given to me. Also, timetopdh is the time in seconds, heightdh the water level at a certain point in time.)
plot(timetopdh,heightopdh,xlim=topvector, ylim = c(0,270))
period1 <- 545
period2 <- 205
period3 <- 85
sin11 <- sin(2*pi/period1*timetopdh)
sin12 <- cos(2*pi/period1*timetopdh)
sin21 <- sin(2*pi/period2*timetopdh)
sin22 <- cos(2*pi/period2*timetopdh)
sin31 <- sin(2*pi/period3*timetopdh)
sin32 <- cos(2*pi/period3*timetopdh)
lmsinus <- lm(heightopdh ~ poly(timetopdh,2) + sin11 + sin12 + sin21 + sin22 + sin31 + sin32)
fitted_for_lines <- fitted(lmsinus)
pred <- predict(lmsinus, newdata=data.frame(timetopdh = timetopdh))
lines(timetopdh, pred, col=3, lwd=3)
That is pretty well matched, it should not follow the data exactly. A linear model is supposed to create the best fit (of a straight line) that it can which reduces the variation in Y as much as it can. The green line does exactly that for you (though not straight)
You have only one line because you asked for only one line. The time and height were plotted point per point against each other in the order of the list they were stored in and passed into plot.
The line was drawn from the prediction using the supplied data, but the points were drawn from those other two sets; timetopdh,heightopdh. If you had wanted the sinus waves to print, then you need to ask for each one but name using an appropriate graphing method for them, usually a line graph.
Related
I'm trying plots a graph lines using ggplot library in R, but I get a good plots but I need reduce the gradual space or height between rows grid lines because I get big separation between lines.
This is my R script:
library(ggplot2)
library(reshape2)
data <- read.csv('/Users/keepo/Desktop/G.Con/Int18/input-int18.csv')
chart_data <- melt(data, id='NRO')
names(chart_data) <- c('NRO', 'leyenda', 'DTF')
ggplot() +
geom_line(data = chart_data, aes(x = NRO, y = DTF, color = leyenda), size = 1)+
xlab("iteraciones") +
ylab("valores")
and this is my actual graphs:
..the first line is very distant from the second. How I can reduce heigth?
regards.
The lines are far apart because the values of the variable plotted on the y-axis are far apart. If you need them closer together, you fundamentally have 3 options:
change the scale (e.g. convert the plot to a log scale), although this can make it harder for people to interpret the numbers. This can also change the behavior of each line, not just change the space between the lines. I'm guessing this isn't what you will want, ultimately.
normalize the data. If the actual value of the variable on the y-axis isn't important, just standardize the data (separately for each value of leyenda).
As stated above, you can graph each line separately. The main drawback here is that you need 3 graphs where 1 might do.
Not recommended:
I know that some graphs will have the a "squiggle" to change scales or skip space. Generally, this is considered poor practice (and I doubt it's an option in ggplot2 because it masks the true separation between the data points. If you really do want a gap, I would look at this post: axis.break and ggplot2 or gap.plot? plot may be too complexe
In a nutshell, the answer here depends on what your numbers mean. What is the story you are trying to tell? Is the important feature of your plots the change between them (in which case, normalizing might be your best option), or the actual numbers themselves (in which case, the space is relevant).
you could use an axis transformation that maps your data to the screen in a non-linear fashion,
fun_trans <- function(x){
d <- data.frame(x=c(800, 2500, 3100), y=c(800,1950, 3100))
model1 <- lm(y~poly(x,2), data=d)
model2 <- lm(x~poly(y,2), data=d)
scales::trans_new("fun",
function(x) as.vector(predict(model1,data.frame(x=x))),
function(x) as.vector(predict(model2,data.frame(y=x))))
}
last_plot() + scale_y_continuous(trans = "fun")
enter image description here
So, I am challenged and request a little guidance.
I have used the rriskDistributions package to evaluate some CDFs for some industrial sector injury data with the get.lnorm.par() function. It fits the data great, unfortunately, the axes require swapping because my response variable is currently on the x-axis, and needs to be on the y-axis. Unfortunately again, the get.lnorm.par() function requires that the probabilities be only on the y-axis, and I cannot figure out how to create the same curve with swapped axes.
I want to get it to look something like this:
An example of the code that I have worked through in ggplot follows:
x <- c(0.0416988,0.0656371,0.1015444,0.1270270,0.1536680,0.1694981,0.2509653)
y <- c(3170221,6810103,14999840,26623982,48903587,74177290,266181110)
prob <- c(x) ## There are 389 different x values, but keeping it simple!
quant <- c(y) ## Same as x.
df1 <- data.frame(prob,quant)
plot2 <- ggplot(df1, aes(x=prob, y=quant)) + geom_point() +
geom_smooth(method="lm", formula= log(y)~x, se=FALSE) +
labs(y="quantiles", x="probabilities", title="Probs vs Quants")
plot2
I have created lines that fit this data, but everything ends at the last data point.
When I used get.lnorm.par(), the fit was great, but like stated previously, the axes require flipping. When I tried this, I continued to get errors about infinite output and could not define the bounds of the function to be plotted.
So, here is the code using the rriskDistributions package:
pct <- c(0.0416988,0.0656371,0.1015444,0.1270270,0.1536680,0.1694981,0.2509653)
my.lnorm<-get.lnorm.par(p=pct, q=c(3170221,6810103,14999840,26623982,48903587,74177290,266181110),
tol = 0.001, scaleX = c(0,0.0809))
Essentially, I am trying to create a fit curve for the data (either exponential or power) that expands, or predicts beyond the final data point. This I cannot figure out for the life of me, and changing any of the parameters in the rriskDistributions functions is quite challenging.
Any thoughts?
Thanks.
I'm simply trying to display the fit I've generated using lm(), but the lines function is giving me a weird result in which there are multiple lines coming out of one point.
Here is my code:
library(ISLR)
data(Wage)
lm.mod<-lm(wage~poly(age, 4), data=Wage)
Wage$lm.fit<-predict(lm.mod, Wage)
plot(Wage$age, Wage$wage)
lines(Wage$age, Wage$lm.fit, col="blue")
I've tried resetting my plot with dev.off(), but I've had no luck. I'm using rStudio. FWIW, the line shows up perfectly fine if I make the regression linear only, but as soon as I make it quadratic or higher (using I(age^2) or poly()), I get a weird graph. Also, the points() function works fine with poly().
Thanks for the help.
Because you forgot to order the points by age first, the lines are going to random ages. This is happening for the linear regression too; he reason it works for lines is because traveling along any set of points along a line...stays on the line!
plot(Wage$age, Wage$wage)
lines(sort(Wage$age), Wage$lm.fit[order(Wage$age)], col = 'blue')
Consider increasing the line width for a better view:
lines(sort(Wage$age), Wage$lm.fit[order(Wage$age)], col = 'blue', lwd = 3)
Just to add another more general tip on plotting model predictions:
An often used strategy is to create a new data set (e.g. newdat) which contains a sequence of values for your predictor variables across a range of possible values. Then use this data to show your predicted values. In this data set, you have a good spread of predictor variable values, but this may not always be the case. With the new data set, you can ensure that your line represents evenly distributed values across the variable's range:
Example
newdat <- data.frame(age=seq(min(Wage$age), max(Wage$age),length=1000))
newdat$pred <- predict(lm.mod, newdata=newdat)
plot(Wage$age, Wage$wage, col=8, ylab="Wage", xlab="Age")
lines(newdat$age, newdat$pred, col="blue", lwd=2)
This is for research I am doing for my Masters Program in Public Health
I am graphing data against each other, a standard x,y type deal, over top of that I am plotting a predicted line. I get what I think to be the most funky looking point/boxplot looking thing ever with an x axis that is half filled out and I don't understand why as I do not call a boxplot function. When I call the plot function it is my understanding that only the points will plot.
The data I am plotting looks like this
TOTAL.LACE | DAYS.TO.FAILURE
9 | 15
16 | 7
... | ...
The range of the TOTAL.LACE is from 0 to 19 and DAYS.TO.FAILURE is 0 - 30
My code is as follows, maybe it is something before the plot but I don't think it is:
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- unique(FAILURE + 1)
# Build a test frame that will predict values of the lace score due to
# a patient being in a state of failure
test <- survreg(Surv(time = DAYS.TO.FAILURE, event = FAILURE) ~ TOTAL.LACE,
dist = "logistic")
pred <- predict(test, type="response") <-- produces numbers from about 14 to 23
summary(pred)
ord <- order(TOTAL.LACE)
tl_ord <- TOTAL.LACE[ord]
pred_ord <- pred[ord]
plot(TOTAL.LACE, DAYS.TO.FAILURE, pch=unique(psymbol)) <-- Produces goofy graph
lines(tl_ord, pred_ord) <-- this produces the line not boxplots
Here is the resulting picture
Not to sure how to proceed from here, this is an off shoot of another problem I had with the same data set at this link here I am not understanding why boxplots are being drawn, the reason being is I did not specifically call the boxplot() command so I don't know why they appeared along with point plots. When I issue the following command: plot(DAYS.TO.FAILURE, TOTAL.LACE) I only get points on the resulting plot like I expected, but when I change the order of what is plotted on x and y the boxplots show up, which to me is unexpected.
Here is a link to sample data that will hopefully help in reproducing the problem as pointed out by #Dwin et all Some Sample Data
Thank you,
Since you don't have a reproducible example, it is a little hard to provide an answer that deals with your situation. Here I generate some vaguely similar-looking data:
set.seed(4)
TOTAL.LACE <- rep(1:19, each=1000)
zero.prob <- rbinom(19000, size=1, prob=.01)
DAYS.TO.FAILURE <- rpois(19000, lambda=15)
DAYS.TO.FAILURE <- ifelse(zero.prob==1, DAYS.TO.FAILURE, 0)
And here is the plot:
First, the problem with some of the categories not being printed on the x-axis is because they don't fit. When you have so many categories, to make them all fit you have to display them in a smaller font. The code to do this is to use cex.axis and set the value <1 (you can read more about this here):
boxplot(DAYS.TO.FAILURE~TOTAL.LACE, cex.axis=.8)
As to the question of why your plot is "goofy" or "funky-looking", it is a bit hard to say, because those terms are rather nebulous. My guess is that you need to more clearly understand how boxplots work, and then understand what these plots are telling you about the distribution of your data. In a boxplot, the midline of the box is the 50th percentile of your data, while the bottom and top of the box are the 25th and 75th percentiles. Typically, the 'whiskers' will extend out to the furthest datapoint that is at most 1.5 times the inter-quartile range beyond the ends of the box. In your case, for the first 9 TOTAL.LACEs, more than 75% of your data are 0's, so there is no box and thus no whiskers are possible. Everything beyond the whisker limits is plotted as an individual point. I don't think your plots are "funky" (although I'll admit I have no idea what you mean by that), I think your data may be "funky" and your boxplots are representing the distributions of your data accurately according to the rules by which boxplots are constructed.
In the future (and I mean this politely), it will help you get more useful and faster answers if you can write questions that are more clearly specified, and contain a reproducible example.
Update: Thanks for providing more information. I gather by "funky" you mean that it is a boxplot, rather than a typical scatterplot. The thing to realize is that plot() is a generic function that will call different methods depending on what you pass to it. If you pass simple continuous data, it will produce a scatterplot, but if you pass continuous data and a factor, then it will produce a boxplot, even if you don't call boxplot explicitly. Consider:
plot(TOTAL.LACE, DAYS.TO.FAILURE)
plot(as.factor(TOTAL.LACE), DAYS.TO.FAILURE)
Evidently, you have converted DAYS.TO.FAILURE to a factor without meaning to. Presumably this was done in the pch=unique(psymbol) argument via the code psymbol <- unique(FAILURE + 1) above. Although I haven't had time to try this, I suspect eliminating that line of code and using pch=(FAILURE + 1) will accomplish your goals.
I've obtained a survival plot from the following code:
s = Surv(outcome.[,1], outcome.[,2])
survplot= (survfit(s ~ person.list[,1]))
plot(survplot, mark.time = FALSE)
person.list is just a list of 15 people.
When I plot this, the lines on my plot all end at different time points. Is there a way to extend all the lines to make them end at a certain time point? (i.e outcome.[,1] is a time to event variable and I would like the survival lines on the plot to extend out to say 5(years) )
Thanks,
Matt
This isn't an answer of how to do what you ask, but rather an explanation of why you should not do what you ask.
The lines stop where the data stops. Beyond that time, you have no information in order to make an estimate of the survival (this is in a traditional Kaplan-Meier survival analysis, as you have set it up). Therefore, the Kaplan-Meier estimate is not well defined beyond that time, and so extending that curve does not have any particular meaning. While graphically you could just draw a horizontal line at the same level as the last survival value, this is not really meaningful.
This is code I posted to a similar question on rhelp a while ago:
http://finzi.psych.upenn.edu/Rhelp10/2010-September/253817.html
?survfit # to get a working example since you did not provide one
lsurv2 <- survfit(Surv(time, status) ~ x, aml, type='fleming')
plot(lsurv2, lty=2:3, xmax=300) # drats, no effect of xmax
str(lsurv2) # so see the structure of the survfit object
lsurv2$time[21] <- 300 #add a time value
lsurv2$n.censor[21] <- 1 # mark as censoring time
lsurv2$strata[2] <- 11 # add to count of group 2
plot(lsurv2, lty=2:3, xmax=300) # horizontal line to 300 for group 2
And this was Therneau's later response (presumably better than mine): http://finzi.psych.upenn.edu/Rhelp10/2010-September/253879.html
plot(surv, mark.time=F, fun='event', xlim=c(0, 54))
for (i in 1:length(surv$strata)) { #number of curves
temp <- surv[i]
lines(c(max(temp$time), 54), 1- rep(min(temp$surv),2))
}