Extrapolation curve R binomial model - r

Here is the code I used:
data<-read.table("YB.txt",header=T)
attach(data)
fit2<-glm(cbind(success,fail)~time*col,data=data,family=binomial)
summary(fit2)
predict.data<-as.data.frame(predict(fit2,newdata=temp.data,type="link",se=TRUE))
new.data<-cbind(temp.data,predict.data)
std<-qnorm(0.95/2+0.5)
new.data$ymin<-fit2$family$linkinv(new.data$fit-std*new.data$se)
new.data$ymax<-fit2$family$linkinv(new.data$fit+std*new.data$se)
new.data$fit<-fit2$family$linkinv(new.data$fit)
op<-cbind(success/(Neggs))
p<-ggplot(data,aes(x=time,y=op,fill=col,color=col))+geom_point()
p+geom_ribbon(data=new.data,aes(y=fit,ymin=ymin,ymax=ymax),alpha=0.1,linetype="dashed")+geom_line(data=new.data,aes(y=fit),linetype="solid")+labs(x="patatou",y="patata",title="patati")+theme_calc()+scale_color_manual(values=c("#CC6666", "#9999CC"))+labs(colour="Eggs color",linetype="Eggs color",shapes="Eggs color")
=> I got two beautiful prediction curves. However, my collected data start at 5 days and end at 13 days. I would like to have the curve for 0-5 days and after 13 days (i.e: to 20 days). In order to have a prediction of what I should get. So I tried this:
NewData<-as.matrix(cbind(time,col))
colnames<-(NewData)
colnames(NewData)<-c("time","col")
predict(fit2,NewData,se.fit=TRUE,scale=NULL,df=Inf,interval=c("none","confidence","prediction"),level=0.95)
Didn't work... Somebody have an idea to solve this?

predict() uses a fitted model to provide you with the values of the y-variable that correspond to the values of the x-variables in the newdata argument. So if you only provide x-variable values that range from 5 to 13, you will only get the corresponding predicted y-variable values. In order to "extend" the prediction line, you need to supply x-variable values over the whole range that you want to plot, e.g., 0 to 20. You will want something like:
x_coords <- seq(from=0, to=20, by=0.1)
y_coords <- predict(fitted_model, newdata=data.frame(x=x_coords))
plot(x, y, xlim=c(0,20))
points(x=x_coords, y=y_coords, type="l")
My answer here (link) provides a worked example using the Auto dataset from the ISLR package.

Related

Messy graph when plotting fitted values from flexmix

I am trying to plot 3 regression lines for 3 components in the data estimated via flexmix package.
However, when I try to plot predicted values for the first component, the result is a messy graph with lines connecting to each other.
These are my codes:
m_1 <- flexmix(x ~ y + z, data=set2, cluster=clstr)
yhat <-fitted(m_1)
plot(x, y, options=...)
lines(x, yhat[,1], options=...)
Online I found some hints about > order() with no result
reorder <- order(yhat[,1])
lines(x[reorder], yhat[,1][reorder], options=...)
It results in a continuous line that looks like a time series with high volatility.
The other two components are working fine. Any idea on how to solve this?
The solution is here I think :
http://pages.mtu.edu/~shanem/psy5220/daily/Day19/Mixture_of_regressions.html

Quantile regression split plots

I have few years of daily rainfall data for a particular region. To get an insight of extreme rainfall events,I used quantile regression (quantreg package) in R. The plot for entire days is shown below. What I want is to split the regression line in the middle (or some other point) and fit for first and second half of the data separately to see the difference.
Here is how I used quantreg:
plot(data$ahmAnn~data$Days, type="p", pch=20,cex=.4, col="gray50",
xlab="Days", ylab="Rainfall")
qr <- abline(rq(data$ahmAnn~data$Days,tau=.99),col="red")
If you want to run quantreg::rq on different sets of your data, replace
data$ahmAnn~data$Days
with
x <- 10
stopifnot(x <= nrow(data))
set1 <- data[1:X,]
abline(rq(set1$ahmAnn~set1$Days,tau=.99),col="red")
set2 <- data[X:nrow(data),]
abline(rq(set2$ahmAnn~set2$Days,tau=.99),col="red")

Convert double differenced forecast into actual value diff() in R

I have already read
Time Series Forecast: Convert differenced forecast back to before difference level
and
How to "undifference" a time series variable
None of these unfortunately gives any clear answer how to convert forecast done in ARIMA using differenced method(diff()) to reach at stationary series.
code sample.
## read data and start from 1 jan 2014
dat<-read.csv("rev forecast 2014-23 dec 2015.csv")
val.ts <- ts(dat$Actual,start=c(2014,1,1),freq=365)
##Check how we can get stationary series
plot((diff(val.ts)))
plot(diff(diff(val.ts)))
plot(log(val.ts))
plot(log(diff(val.ts)))
plot(sqrt(val.ts))
plot(sqrt(diff(val.ts)))
##I found that double differencing. i.e.diff(diff(val.ts)) gives stationary series.
#I ran below code to get value of 3 parameters for ARIMA from auto.arima
ARIMAfit <- auto.arima(diff(diff(val.ts)), approximation=FALSE,trace=FALSE, xreg=diff(diff(xreg)))
#Finally ran ARIMA
fit <- Arima(diff(diff(val.ts)),order=c(5,0,2),xreg = diff(diff(xreg)))
#plot original to see fit
plot(diff(diff(val.ts)),col="orange")
#plot fitted
lines(fitted(fit),col="blue")
This gives me a perfect fit time series. However, how do i reconvert fitted values into their original metric from the current form it is now in? i mean from double differencing into actual number? For log i know we can do 10^fitted(fit) for square root there is similar solution, however what to do for differencing, that too double differencing?
Any help on this please in R? After days of rigorous exercise, i am stuck at this point.
i ran test to check if differencing has any impact on model fit of auto.arima function and found that it does. so auto.arima can't handle non stationary series and it requires some effort on part of analyst to convert the series to stationary.
Firstly, auto.arima without any differencing. Orange color is actual value, blue is fitted.
ARIMAfit <- auto.arima(val.ts, approximation=FALSE,trace=FALSE, xreg=xreg)
plot(val.ts,col="orange")
lines(fitted(ARIMAfit),col="blue")
secondly, i tried differencing
ARIMAfit <- auto.arima(diff(val.ts), approximation=FALSE,trace=FALSE, xreg=diff(xreg))
plot(diff(val.ts),col="orange")
lines(fitted(ARIMAfit),col="blue")
enter image description here
thirdly, i did differencing 2 times.
ARIMAfit <- auto.arima(diff(diff(val.ts)), approximation=FALSE,trace=FALSE,
xreg=diff(diff(xreg)))
plot(diff(diff(val.ts)),col="orange")
lines(fitted(ARIMAfit),col="blue")
enter image description here
A visual inspection can suggest that 3rd graph is more accurate out of all. This i am aware of. The challenge is how to reconvert this fitted value which is in the form of double differenced form into the actual metric!
The opposite of diff is kind of cumsum, but you need to know the starting values at each diff.
e.g:
set.seed(1234)
x <- runif(100)
z <- cumsum(c(x[1], cumsum(c(diff(x)[1], diff(diff(x))))))
all.equal(z, x)
[1] TRUE
Share some of your data to make a reproducible example to better help answer the question.
If you expect that differencing will be necessary to obtain stationarity, then why not simply include the maximum differencing order in the function call? That is, the "I" in ARIMA is the order of differencing prior to fitting an ARMA model, such that if
y = diff(diff(x)) and y is an ARMA(p,q) process,
then
x follows an ARIMA(p,2,q) process.
In auto.arima() you specify the differencing with the d argument (or D if it involves seasons). So, you want something like this (for a maximum of 3 differences):
fit <- auto.arima(val.ts, d=3, ...)
From this, you can verify that the fitted values will indeed map onto the original data
plot(val.ts)
lines(fit, col="blue")
In the example below containing dummy data, I have double differenced. First, I removed seasonality (lag = 12) and then I removed trend from the differenced data (lag = 1).
set.seed(1234)
x <- rep(NA,24)
x <- x %>%
rnorm(mean = 10, sd = 5) %>%
round(.,0) %>%
abs()
yy <- diff(x, lag = 12)
z <- diff(yy, lag = 1)
Using the script that #jeremycg included above and I include below, how would I remove the double difference? Would I need to add lag specifiers to the two nested diff() commands? If so, which diff() would have the lag = 12 specifier and which would have the lag = 1?
zz <- cumsum(c(x[1], cumsum(c(diff(x)[1], diff(diff(x))))))

Density plot in R sometimes gives frequency, other times probabilities?

Plotting the density of some of my data yields frequencies on the Y axis, while plotting the density of other data yields probabilities(?) on the Y axis. Is there an equivalent of freq=FALSE for density() like there is for hist() so I can have control over this? I've tried searching around for this specific issue, but I almost always end up getting hist() documentation instead of finding the answer to this specific question. Thank you!
Adding such a parameter to density would be statistically unwise for the reasons articulated by #MrFlick. If you want to convert a density estimate to be on the same scale as the observations, you can multiply by the length of the vector used for the density calculation. The density then becomes a "per x unit" estimate of "frequency". Compare the two plots:
set.seed(123);x <- sample(1:10, size=5 )
#> x
#[1] 3 8 4 7 6
plot(density(x))
plot(5*density(x)$y)
The "per unit of x" estimate is now in the correct (approximate) range of 0.5 (and it's integral should be roughly equal to the counts). It's only accidentally that an x value of a density would ever be similar to a probability. It should always be that the integral of the density is unity.
Perhaps you are looking for the ecdf function? Instead of returning a density , it provides a mechanism for constructing a cumulative probability function.

Add normal curve and horizontal box-plot to already tabulated survey data

I have some already tabulated survey data imported in a data frame and can making bar charts from it with ggplot.
X X.1 X.2
3 Less than 1 year 7
4 1-5 years 45
5 6-10 years 84
6 11-15 years 104
7 16 or more years 249
ggplot(responses[3:7,], aes(y=X.2, factor(X))) + geom_bar()
I would like to overlay a normal curve on the bar chart, and a horizontal box and whisker plot below that but I am unsure about the correct way to do this without the individual observations, it should be possible... I think. The example output I am trying to emulate is here: http://t.co/yOqRmOj5
I look forward to learning a new trick for this if there is one, or if anyone else had encountered it.
To save anyone else having to download the 134 page PDF, here is an example of the graph referenced in the question.
In this example, the data is from a Likert scale, and so the original data can be extrapolated and a normal curve and boxplot is at least interpretable. However, there are plots where the horizontal scale is nominal. Normal curves make no sense in these cases.
Your question is about an ordinal scale. Just from this summarized data, it is not reasonable to try and make a normal curve. You could treat each entry as located at the center point of its range (0.5 years, 3 years, 8 years, etc.), but there is no way to reasonably assign a value for the highest group (and worse, it is your largest, so its contribution is not insignificant). You must have the original data to make any reasonable approximation.
If you just want a density estimation based on the data that you have, then the oldlogspline function in the logspline package can fit density estimates to interval censored data:
mymat <- cbind( c(0,1,5.5,10.5, 15.5), c(1,5.5,10.5, 15.5, Inf) )[rep(1:5, c(7,45,84,104,249)),]
library(logspline)
fit <- oldlogspline(interval=mymat[mymat[,2] < 100,],
right=mymat[ mymat[,2]>100, 1], lbound=0)
fit2 <- oldlogspline.to.logspline(fit)
hist( mymat[,1]+0.5, breaks=c(0,1,5.5,10.5,15.5,60), main='', xlab='Years')
plot(fit2, add=TRUE, col='blue')
If you want a normal distribution, then the survreg function in the survival package will fit interval censored data:
library(survival)
mymat2 <- mymat
mymat2[ mymat2>100 ] <- NA
fit3 <- survreg( Surv(mymat2[,1], mymat2[,2], ,type='interval2') ~ 1,
dist='gaussian', control=survreg.control(maxiter=100) )
curve( dnorm(x, coef(fit3), fit3$scale), from=0, to=60, col='green', add=TRUE)
Though a different distribution may fit better:
fit4 <- survreg( Surv(mymat2[,1]+.01, mymat2[,2], ,type='interval2') ~ 1,
dist='weibull', control=survreg.control(maxiter=100) )
curve( dweibull(x, scale=exp(coef(fit4)), shape=1/fit4$scale),
from=0, to=60, col='red', add=TRUE)
You could also fit a discrete distribution using fitdistr in MASS:
library(MASS)
tmpfun <- function(x, size, prob) {
ifelse(x==0, dnbinom(0,size,prob),
ifelse(x < 5, pnbinom(5,size,prob)-pnbinom(0,size,prob),
ifelse(x < 10, pnbinom(10,size,prob)-pnbinom(5,size,prob),
ifelse(x < 15, pnbinom(15,size,prob)-pnbinom(10,size,prob),
pnbinom(15,size,prob, lower.tail=FALSE)))))
}
fit5 <- fitdistr( mymat[,1], tmpfun, start=list(size=6, prob=0.28) )
lines(0:60, dnbinom(0:60, fit5$estimate[1], fit5$estimate[2]),
type='h', col='orange')
If you wanted something a little more fuzz, such that 5.5 years could have been reported as either 5 or 6 years, and missing or I don't knows could be used to some degree (with some assumptions), then the EM algorithm could be used to estimate parameters (but this is a lot more complicated and you need to specify your assumptions in how the actual values would translate to observed values).
There might be a better way to look at that data. Since it is constrained by design to be integer valued, perhaps fitting a Poisson or Negative Binomial distribution might be more sensible. I think you should ponder the fact that the X values in the data you present are somewhat arbitrary. There appears to be no good reason to think that 3 is the most appropriate value for the lowest category. Why not 1?
And then, of course, you need to explain what that data refers to. It does not look to be at all Normal or even Poisson distributed. It is very left skewed and there are not a lot of left skewed distributions in common usage (despite there being an infinite number of possible such distributions.
If you just wanted to demonstrate how non-Normal this data is even ignoring the fact that you are fitting a trucated version of a Normal distribution, then take a look at this exercise in plotting:
barp <- barplot( dat$X.2)
barp
# this is what barplot returns and is then used as the x-values for a call to lines.
[,1]
[1,] 0.7
[2,] 1.9
[3,] 3.1
[4,] 4.3
[5,] 5.5
lines(barp, 1000*dnorm(seq(3,7), 7,2))

Resources