I'm trying to perform a simple linear regression on a dataset that uses a date range as the independent variable. I want to plot the data along with the regression line and equation. The dataset is not a time series. When I create the plot, it looks like
this.
The plot looks fine, however the slope is clearly not zero and the intercept should not be 0.945 as the equation states. This article gives a good explanation about how the programming language will shift the origin of the date range to a pre-specified zero starting point, which I believe is 1/1/1970 in R. I think this is what is happening in my case. While the article gives a good explanation of the problem, it's solution doesn't detail how to fix the problem in R. So my question is, how do I shift the origin of my independent variable from 1/1/1970 to the first date in my dataset when performing the linear regression in R?
I've tried converting the date range to both numeric and factor, neither of which was the solution. I suspect that I'm not using the right search terms when searching for solutions on the web. Most of what comes up describes how to force the regression line through the origin, which is not what I want to do here. Thanks for the help.
EDIT: My code to produce the plot is below. 'agcy' is a sample of the data. The actual dataset has more than a thousand points.
agcy <- data.frame(as.Date(c('2010-01-01', '2011-02-01', '2012-11-18', '2016-08-30', '2017-04-21')), c(-0.3, -0.1, -0.1, -0.2, -0.4))
colnames(agcy) <- c('Date', 'Diff')
png('C:\\Desktop\\file.png', width = 720, height = 480)
samps <- 0.05
MA <- movAvg(agcy, agcy$Diff, samps) #movAvg() is a user-defined function that computes the moving average of the data series, "samps" is the proportion of data points to use in the moving average calculation
model <- lm(MA ~ agcy$Date)
intercept <- round(coef(model)[1],3)
slope <- round(coef(model)[2],3)
r2 <- round(summary(model)$r.squared,3)
eqn <- sapply(c(bquote(italic(y) == .(slope)*italic(x) + .(intercept)),bquote(r^2 == .(r2))), as.expression)
plot(agcy$Date, MA, xlab='Date',ylab='Difference',col = 'green', type = 'p', pch=18, ylim = c(-0.5,0.5))
abline(model)
text(par('usr')[2],c(0.5,0.45),eqn,pos = 2)
abline(h=0, col = 'black')
dev.off()
Related
I have a problem with the following graph:
For every value that I put in plot() I get this graph. Does anyone maybe know what it means?
Cor.test works, I got weak correlation.
my code:
cor.test(podatki$v54, podatki$v197, method = c("pearson"),
conf.level = 0.95, use = "all.obs" )
plot(podatki$v54, podatki$v197)
Your graph looks that way because the points are plotted directly on top of one another.
You can use the jitter(...) function to add small amounts of randomness to the data points so they aren't directly on top of one another (it jitters them around so you can see the ones underneath!) Here is an example you can copy and paste:
# create some random numbers to plot. all are values 1-5.
x1 <- sample(c(1:5), 100, replace = TRUE)
x2 <- sample(c(1:5), 100, replace = TRUE)
# plotting without jitter
plot(x1, x2)
# plotting with jitter
plot(jitter(x1), jitter(x2))
jitter(...) changes the values by small amounts so only use the jittered data for plotting, otherwise it will bias your results!
So, I am challenged and request a little guidance.
I have used the rriskDistributions package to evaluate some CDFs for some industrial sector injury data with the get.lnorm.par() function. It fits the data great, unfortunately, the axes require swapping because my response variable is currently on the x-axis, and needs to be on the y-axis. Unfortunately again, the get.lnorm.par() function requires that the probabilities be only on the y-axis, and I cannot figure out how to create the same curve with swapped axes.
I want to get it to look something like this:
An example of the code that I have worked through in ggplot follows:
x <- c(0.0416988,0.0656371,0.1015444,0.1270270,0.1536680,0.1694981,0.2509653)
y <- c(3170221,6810103,14999840,26623982,48903587,74177290,266181110)
prob <- c(x) ## There are 389 different x values, but keeping it simple!
quant <- c(y) ## Same as x.
df1 <- data.frame(prob,quant)
plot2 <- ggplot(df1, aes(x=prob, y=quant)) + geom_point() +
geom_smooth(method="lm", formula= log(y)~x, se=FALSE) +
labs(y="quantiles", x="probabilities", title="Probs vs Quants")
plot2
I have created lines that fit this data, but everything ends at the last data point.
When I used get.lnorm.par(), the fit was great, but like stated previously, the axes require flipping. When I tried this, I continued to get errors about infinite output and could not define the bounds of the function to be plotted.
So, here is the code using the rriskDistributions package:
pct <- c(0.0416988,0.0656371,0.1015444,0.1270270,0.1536680,0.1694981,0.2509653)
my.lnorm<-get.lnorm.par(p=pct, q=c(3170221,6810103,14999840,26623982,48903587,74177290,266181110),
tol = 0.001, scaleX = c(0,0.0809))
Essentially, I am trying to create a fit curve for the data (either exponential or power) that expands, or predicts beyond the final data point. This I cannot figure out for the life of me, and changing any of the parameters in the rriskDistributions functions is quite challenging.
Any thoughts?
Thanks.
I'm trying to fit multiple sine waves (three to be precise) to data using the lm function in R. I am able to get a result, but it looks far from correct:
The green line is kinda wobbly, as it should be, but it seems to be only a single sine (with parabola added), and doesn't match the data very well. What am I doing wrong?
The code I used: (The periods are hardcoded, as they were given to me. Also, timetopdh is the time in seconds, heightdh the water level at a certain point in time.)
plot(timetopdh,heightopdh,xlim=topvector, ylim = c(0,270))
period1 <- 545
period2 <- 205
period3 <- 85
sin11 <- sin(2*pi/period1*timetopdh)
sin12 <- cos(2*pi/period1*timetopdh)
sin21 <- sin(2*pi/period2*timetopdh)
sin22 <- cos(2*pi/period2*timetopdh)
sin31 <- sin(2*pi/period3*timetopdh)
sin32 <- cos(2*pi/period3*timetopdh)
lmsinus <- lm(heightopdh ~ poly(timetopdh,2) + sin11 + sin12 + sin21 + sin22 + sin31 + sin32)
fitted_for_lines <- fitted(lmsinus)
pred <- predict(lmsinus, newdata=data.frame(timetopdh = timetopdh))
lines(timetopdh, pred, col=3, lwd=3)
That is pretty well matched, it should not follow the data exactly. A linear model is supposed to create the best fit (of a straight line) that it can which reduces the variation in Y as much as it can. The green line does exactly that for you (though not straight)
You have only one line because you asked for only one line. The time and height were plotted point per point against each other in the order of the list they were stored in and passed into plot.
The line was drawn from the prediction using the supplied data, but the points were drawn from those other two sets; timetopdh,heightopdh. If you had wanted the sinus waves to print, then you need to ask for each one but name using an appropriate graphing method for them, usually a line graph.
I am trying to smooth my data set, using kernel or loess smoothing method. But, They are all not clear or not what I want. Several questions are the followings.
My x data is "conc" and y data is "depth", which is ex. cm.
1) Kernel smooth
k <- kernel("daniell", 150)
plot(k)
K <- kernapply(conc, k)
plot(conc~depth)
lines(K, col = "red")
Here, my data is smoothed by frequency=150. This means that every data point is averaged by neighboring (right and left) 150 data points? What "daniell" means? I could not find what it means online.
2) Loess smooth
p<-qplot(depth, conc, data=total)
p1 <- p + geom_smooth(method = "loess", size = 1, level=0.95)
Here, what is the default of loess smooth function? If I want to smooth my data with frequency=150 like above case (moving average by every 150 data point), how can I modify this code?
3) To show y-axis with log scale, I put "log10(conc)", instead of "conc", and it worked. But, I cannot change the y-axis tick label. I tried to use "scale_y_log10(limits = c(1,1e3))" in my code to show axis tick labe like 10^0, 10^1, 10^2..., but did not work.
Please answer my questions. Thanks a lot for your help.
Sum
I have a probability density function in a plot called ph that i derived from two samples of data, by the help of a user of stackoverflow, in this way
few <-read.table('outcome.dat',head=TRUE)
many<-read.table('alldata.dat',head=TRUE)
mh <- hist(many$G,breaks=seq(0,1.,by=0.03), plot=FALSE)
fh <- hist(few$G, breaks=mh$breaks, plot=FALSE)
ph <- fh
ph$density <- fh$counts/(mh$counts+0.001)
plot(ph,freq=FALSE,col="blue")
I would like to fit the best curve of the plot of ph, but i can't find a working method.
how can i do this? I have to extract the vaule from ph and then works on they? or there is same function that works on
plot(ph,freq=FALSE,col="blue")
directly?
Assuming you mean that you want to perform a curve fit to the data in ph, then something along the lines of
nls(FUN, cbind(ph$counts, ph$mids),...) may work. You need to know what sort of function 'FUN' you think the histogram data should fit, e.g. normal distribution. Read the help file on nls() to learn how to set up starting "guess" values for the coefficients in FUN.
If you simply want to overlay a curve onto the histogram, then smoo<-spline(ph$mids,ph$counts);
lines(smoo$x,smoo$y)
will come close to doing that. You may have to adjust the x and/or y scaling.
Do you want a density function?
x = rnorm(1000)
hist(x, breaks = 30, freq = FALSE)
lines(density(x), col = "red")