Related
I´m trying to calculate the cumulative distribution function of the skewed generalized error distribution with the probability density function from Theodossiou(http://www.mfsociety.org/modules/modDashboard/uploadFiles/journals/MJ~0~p1a4fjq38m1k2p45t6481fob7rp4.pdf):
And in R it looks like this:
psi <- -0.09547862
m <- 0.1811856
g <- -0.1288893
d <- 0.8029088
c <- (2/(1+exp(-g)))-1
p <- exp(psi)
y <- function(x) ((d**(1-(1/d)))/(2*p))*gamma(1/d)**(-1)*exp(-(1/d)*((abs(x-m)**d)/((1+sign(x-m)*c)**(d)*p**(d))))
The hole reason I do this is to fit the skewed generalized error distribution to my data and asses the distributions fit to my data by creating a qq-plot. So now I need to calculate the cumulativ distribution function and then the inverse cdf. For the invers cdf I plan to use the inversion-function from the GofKernel-Package. But for this I need the cdf. Is there anyway to calculate that with numerical integration in R?
To get a cumulative function via integration you can pass the x-values to a function that integrates from a suitable extreme low value to an upper limit that is x
# First look at the density function
plot( y(x) ~ x )
cum <- sapply(x, function(x) integrate(y,-10, x)$value )
plot( cum ~ x)
# So the inverse is just `x` as a function of `cum`
plot( x ~ cum)
In general, if you want to estimate the cumulative distribution function, use the function ecdf as follows:
x <- seq(-10,10,0.1)
Fn <- ecdf(y(x))
plot(Fn)
If you want to visualize how two data sets are similar, use qqplot as follows:
y1 <- y(x) # from your function
y2 <- rnorm(100) # some generic data
qqplot(y1, y2) # if the two data sets are from the same
# distribution, you should see a straight line
I would like to use plotmo instruction from Plotmo package to plot an arima object I estimate arima model with a matrix of explanatory variables X ( transfer function)
arima.model<-arima(y,c(3,1,3),xreg=X)
When plotting this object I have the next error:
plotmo(arima.model) stats::predict(Arima.object, data.frame[3,1], type="response")
Error in predict.Arima(list(coef = c(0, 0, 0.426819838403672, -0.23337107002535, : 'xreg' and 'newxreg' have different numbers of columns
How could I fix this problem? Thanks C
Plotmo
isn't really meant for time-series models like arima models
and doesn't support them.
However, if you just want to plot the fitted model and some future
values, the following function will do it (there may be a simpler
way using the ts.plot function):
plarima <- function(ts, ..., n.ahead=1, main=deparse(substitute(ts)))
{
model <- arima(ts, ...)
if(!inherits(model, "Arima"))
stop("this function requires 'arima' from the standard stats package")
# calculations so we can extend the x axis
n <- length(ts)
x <- xy.coords(ts)$x
if(any(is.na(x)))
stop("NA in time")
xdelta <- (x[n] - x[1]) / n
plot(ts + model$residuals, # plot the fit in gray
xlim=c(x[1], x[n] + xdelta * n.ahead),
main=main, col="gray", lwd=3)
lines(ts) # plot the data
# predict n.ahead values and plot them in red
forecast <- predict(model, n.ahead=n.ahead)
lines(x=x[n] + xdelta * (0:n.ahead), y=c(ts[n], forecast$pred), col=2)
legend("topleft", legend=c("data", "fitted", "forecast"),
col=c(1, "gray", 2), lwd=c(1,3,1), lty=1, bg="white")
model # return the arima model
}
For example
plarima(lh, order=c(3,0,0), n.ahead=10)
plarima(USAccDeaths, order=c(0,1,1), seas=list(order=c(0,1,1)), n.ahead=10)
gives the following plots
(I'm assuming you are using the arima function from the
standard stats package. I think the forecast package also
has an arima function.)
Was trying to predict the future value of a sample using polynomial regression in R. The y values within the sample forms a wave pattern.
For example
x = 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
y= 1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4
But when the graph is plotted for future values the resultant y values was completely different from what was expected. Instead of a wave pattern, was getting a graph where the y values keep increasing.
futurY = 17,18,19,20,21,22
Tried different degrees of polynomial regression, but the predicted results for futurY were drastically different from what was expected
Following is the sample R code which was used to get the results
dfram <- data.frame('x'=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
dfram$y <- c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4)
plot(dfram,dfram$y,type="l", lwd=3)
pred <- data.frame('x'=c(17,18,19,20,21,22))
myFit <- lm(y ~ poly(x,5), data=dfram)
newdata <- predict(myFit, pred)
print(newdata)
plot(pred[,1],data.frame(newdata)[,1],type="l",col="red", lwd=3)
Is this the correct technique to be used for predicting the unknown future y values OR should I be using other techniques like forecasting?
# Reproducing your data frame
dfram <- data.frame("x" = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16),
"y" = c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4))
From your graph I've got the phase and period of the signal. There're better ways of calculating that automatically.
# Phase and period
fase = 1
per = 10
In the linear model function I've put the triangular signal equations.
fit <- lm(y ~ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * (x-fase)%%(per/2))
+ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * ((per/2)-((x-fase)%%(per/2))))
,data=dfram)
# Predict the old data
p_olddata <- predict(fit,type="response")
# Predict the new data
newdata <- data.frame('x'=c(17,18,19,20,21,22))
p_newdata <- predict(fit,newdata,type="response")
# Ploting Old and new data
plot(x=c(dfram$x,newdata$x),
y=c(p_olddata,p_newdata),
col=c(rep("blue",length(p_olddata)),rep("green",length(p_olddata))),
xlab="x",
ylab="y")
lines(dfram)
Where the black line is the original signal, the blue circles are the prediction for the original points and the green circles are the prediction for the new data.
The graph shows a perfect fit for the model because there's no noise in the data. In a real dataset you may find it so the fit will not look as nice as that.
Plotting data.
x.values are 16-23 (age)
and for every year (16-23) I have 5 different values between 1 and ten for each age.
I want a scatterplot, all five values for each age plotted and then have a regression line and calculate the correlation after that.
x <- (16:23)
y<- c(10,8,9,9,8,7,8,6,9,6,6,7,7,8,5,5,8,4,7,6,8,7,6,8,4,6,5,7,5,3,5,1,3,4,2,4,1,2,5)
Studie <- plot(cbind(x, y))
It's just random plots for y. Idk how to get the plot
If I well understand your problem, I will do like that:
x <- rep(16:23, each=5)
y<- c(10,8,9,9,8,7,8,6,9,6,6,7,7,8,5,5,8,4,7,6,8,7,6,8,4,6,5,7,5,3,5,1,3,4,2,4,1,2,5, 2)
plot(x,y,col=x-15)
reg <- lm(y~x)
summary(reg)
lines(x, reg$fitted.values)
be careful, a value is missing in vector y. I've added 2 at the end
the regression line is:
y = -0.7929 x + 21.2357
I need to colour datapoints that are outside of the the confidence bands on the plot below differently from those within the bands. Should I add a separate column to my dataset to record whether the data points are within the confidence bands? Can you provide an example please?
Example dataset:
## Dataset from http://www.apsnet.org/education/advancedplantpath/topics/RModules/doc1/04_Linear_regression.html
## Disease severity as a function of temperature
# Response variable, disease severity
diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4)
# Predictor variable, (Centigrade)
temperature<-c(2,1,5,5,20,20,23,10,30,25)
## For convenience, the data may be formatted into a dataframe
severity <- as.data.frame(cbind(diseasesev,temperature))
## Fit a linear model for the data and summarize the output from function lm()
severity.lm <- lm(diseasesev~temperature,data=severity)
# Take a look at the data
plot(
diseasesev~temperature,
data=severity,
xlab="Temperature",
ylab="% Disease Severity",
pch=16,
pty="s",
xlim=c(0,30),
ylim=c(0,30)
)
title(main="Graph of % Disease Severity vs Temperature")
par(new=TRUE) # don't start a new plot
## Get datapoints predicted by best fit line and confidence bands
## at every 0.01 interval
xRange=data.frame(temperature=seq(min(temperature),max(temperature),0.01))
pred4plot <- predict(
lm(diseasesev~temperature),
xRange,
level=0.95,
interval="confidence"
)
## Plot lines derrived from best fit line and confidence band datapoints
matplot(
xRange,
pred4plot,
lty=c(1,2,2), #vector of line types and widths
type="l", #type of plot for each column of y
xlim=c(0,30),
ylim=c(0,30),
xlab="",
ylab=""
)
Well, I thought that this would be pretty easy with ggplot2, but now I realize that I have no idea how the confidence limits for stat_smooth/geom_smooth are calculated.
Consider the following:
library(ggplot2)
pred <- as.data.frame(predict(severity.lm,level=0.95,interval="confidence"))
dat <- data.frame(diseasesev,temperature,
in_interval = diseasesev <=pred$upr & diseasesev >=pred$lwr ,pred)
ggplot(dat,aes(y=diseasesev,x=temperature)) +
stat_smooth(method='lm') + geom_point(aes(colour=in_interval)) +
geom_line(aes(y=lwr),colour=I('red')) + geom_line(aes(y=upr),colour=I('red'))
This produces:
alt text http://ifellows.ucsd.edu/pmwiki/uploads/Main/strangeplot.jpg
I don't understand why the confidence band calculated by stat_smooth is inconsistent with the band calculated directly from predict (i.e. the red lines). Can anyone shed some light on this?
Edit:
figured it out. ggplot2 uses 1.96 * standard error to draw the intervals for all smoothing methods.
pred <- as.data.frame(predict(severity.lm,se.fit=TRUE,
level=0.95,interval="confidence"))
dat <- data.frame(diseasesev,temperature,
in_interval = diseasesev <=pred$fit.upr & diseasesev >=pred$fit.lwr ,pred)
ggplot(dat,aes(y=diseasesev,x=temperature)) +
stat_smooth(method='lm') +
geom_point(aes(colour=in_interval)) +
geom_line(aes(y=fit.lwr),colour=I('red')) +
geom_line(aes(y=fit.upr),colour=I('red')) +
geom_line(aes(y=fit.fit-1.96*se.fit),colour=I('green')) +
geom_line(aes(y=fit.fit+1.96*se.fit),colour=I('green'))
The easiest way is probably to calculate a vector of TRUE/FALSE values that indicate if a data point is inside of the confidence interval or not. I'm going to reshuffle your example a little bit so that all of the calculations are completed before the plotting commands are executed- this provides a clean separation in the program logic that could be exploited if you were to package some of this into a function.
The first part is pretty much the same, except I replaced the additional call to lm() inside predict() with the severity.lm variable- there is no need to use additional computing resources to recalculate the linear model when we already have it stored:
## Dataset from
# apsnet.org/education/advancedplantpath/topics/
# RModules/doc1/04_Linear_regression.html
## Disease severity as a function of temperature
# Response variable, disease severity
diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4)
# Predictor variable, (Centigrade)
temperature<-c(2,1,5,5,20,20,23,10,30,25)
## For convenience, the data may be formatted into a dataframe
severity <- as.data.frame(cbind(diseasesev,temperature))
## Fit a linear model for the data and summarize the output from function lm()
severity.lm <- lm(diseasesev~temperature,data=severity)
## Get datapoints predicted by best fit line and confidence bands
## at every 0.01 interval
xRange=data.frame(temperature=seq(min(temperature),max(temperature),0.01))
pred4plot <- predict(
severity.lm,
xRange,
level=0.95,
interval="confidence"
)
Now, we'll calculate the confidence intervals for the origional data points and run a test to see if the points are inside the interval:
modelConfInt <- predict(
severity.lm,
level = 0.95,
interval = "confidence"
)
insideInterval <- modelConfInt[,'lwr'] < severity[['diseasesev']] &
severity[['diseasesev']] < modelConfInt[,'upr']
Then we'll do the plot- first a the high-level plotting function plot(), as you used it in your example, but we will only plot the points inside the interval. We will then follow up with the low-level function points() which will plot all the points outside the interval in a different color. Finally, matplot() will be used to fill in the confidence intervals as you used it. However instead of calling par(new=TRUE) I prefer to pass the argument add=TRUE to high-level functions to make them act like low level functions.
Using par(new=TRUE) is like playing a dirty trick a plotting function- which can have unforeseen consequences. The add argument is provided by many functions to cause them to add information to a plot rather than redraw it- I would recommend exploiting this argument whenever possible and fall back on par() manipulations as a last resort.
# Take a look at the data- those points inside the interval
plot(
diseasesev~temperature,
data=severity[ insideInterval,],
xlab="Temperature",
ylab="% Disease Severity",
pch=16,
pty="s",
xlim=c(0,30),
ylim=c(0,30)
)
title(main="Graph of % Disease Severity vs Temperature")
# Add points outside the interval, color differently
points(
diseasesev~temperature,
pch = 16,
col = 'red',
data = severity[ !insideInterval,]
)
# Add regression line and confidence intervals
matplot(
xRange,
pred4plot,
lty=c(1,2,2), #vector of line types and widths
type="l", #type of plot for each column of y
add = TRUE
)
I liked the idea and tried to make a function for that. Of course it's far from being perfect. Your comments are welcome
diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4)
# Predictor variable, (Centigrade)
temperature<-c(2,1,5,5,20,20,23,10,30,25)
## For convenience, the data may be formatted into a dataframe
severity <- as.data.frame(cbind(diseasesev,temperature))
## Fit a linear model for the data and summarize the output from function lm()
severity.lm <- lm(diseasesev~temperature,data=severity)
# Function to plot the linear regression and overlay the confidence intervals
ci.lines<-function(model,conf= .95 ,interval = "confidence"){
x <- model[[12]][[2]]
y <- model[[12]][[1]]
xm<-mean(x)
n<-length(x)
ssx<- sum((x - mean(x))^2)
s.t<- qt(1-(1-conf)/2,(n-2))
xv<-seq(min(x),max(x),(max(x) - min(x))/100)
yv<- coef(model)[1]+coef(model)[2]*xv
se <- switch(interval,
confidence = summary(model)[[6]] * sqrt(1/n+(xv-xm)^2/ssx),
prediction = summary(model)[[6]] * sqrt(1+1/n+(xv-xm)^2/ssx)
)
# summary(model)[[6]] = 'sigma'
ci<-s.t*se
uyv<-yv+ci
lyv<-yv-ci
limits1 <- min(c(x,y))
limits2 <- max(c(x,y))
predictions <- predict(model, level = conf, interval = interval)
insideCI <- predictions[,'lwr'] < y & y < predictions[,'upr']
x_name <- rownames(attr(model[[11]],"factors"))[2]
y_name <- rownames(attr(model[[11]],"factors"))[1]
plot(x[insideCI],y[insideCI],
pch=16,pty="s",xlim=c(limits1,limits2),ylim=c(limits1,limits2),
xlab=x_name,
ylab=y_name,
main=paste("Graph of ", y_name, " vs ", x_name,sep=""))
abline(model)
points(x[!insideCI],y[!insideCI], pch = 16, col = 'red')
lines(xv,uyv,lty=2,col=3)
lines(xv,lyv,lty=2,col=3)
}
Use it like this:
ci.lines(severity.lm, conf= .95 , interval = "confidence")
ci.lines(severity.lm, conf= .85 , interval = "prediction")