Quantile regression split plots - r

I have few years of daily rainfall data for a particular region. To get an insight of extreme rainfall events,I used quantile regression (quantreg package) in R. The plot for entire days is shown below. What I want is to split the regression line in the middle (or some other point) and fit for first and second half of the data separately to see the difference.
Here is how I used quantreg:
plot(data$ahmAnn~data$Days, type="p", pch=20,cex=.4, col="gray50",
xlab="Days", ylab="Rainfall")
qr <- abline(rq(data$ahmAnn~data$Days,tau=.99),col="red")

If you want to run quantreg::rq on different sets of your data, replace
data$ahmAnn~data$Days
with
x <- 10
stopifnot(x <= nrow(data))
set1 <- data[1:X,]
abline(rq(set1$ahmAnn~set1$Days,tau=.99),col="red")
set2 <- data[X:nrow(data),]
abline(rq(set2$ahmAnn~set2$Days,tau=.99),col="red")

Related

Extrapolation curve R binomial model

Here is the code I used:
data<-read.table("YB.txt",header=T)
attach(data)
fit2<-glm(cbind(success,fail)~time*col,data=data,family=binomial)
summary(fit2)
predict.data<-as.data.frame(predict(fit2,newdata=temp.data,type="link",se=TRUE))
new.data<-cbind(temp.data,predict.data)
std<-qnorm(0.95/2+0.5)
new.data$ymin<-fit2$family$linkinv(new.data$fit-std*new.data$se)
new.data$ymax<-fit2$family$linkinv(new.data$fit+std*new.data$se)
new.data$fit<-fit2$family$linkinv(new.data$fit)
op<-cbind(success/(Neggs))
p<-ggplot(data,aes(x=time,y=op,fill=col,color=col))+geom_point()
p+geom_ribbon(data=new.data,aes(y=fit,ymin=ymin,ymax=ymax),alpha=0.1,linetype="dashed")+geom_line(data=new.data,aes(y=fit),linetype="solid")+labs(x="patatou",y="patata",title="patati")+theme_calc()+scale_color_manual(values=c("#CC6666", "#9999CC"))+labs(colour="Eggs color",linetype="Eggs color",shapes="Eggs color")
=> I got two beautiful prediction curves. However, my collected data start at 5 days and end at 13 days. I would like to have the curve for 0-5 days and after 13 days (i.e: to 20 days). In order to have a prediction of what I should get. So I tried this:
NewData<-as.matrix(cbind(time,col))
colnames<-(NewData)
colnames(NewData)<-c("time","col")
predict(fit2,NewData,se.fit=TRUE,scale=NULL,df=Inf,interval=c("none","confidence","prediction"),level=0.95)
Didn't work... Somebody have an idea to solve this?
predict() uses a fitted model to provide you with the values of the y-variable that correspond to the values of the x-variables in the newdata argument. So if you only provide x-variable values that range from 5 to 13, you will only get the corresponding predicted y-variable values. In order to "extend" the prediction line, you need to supply x-variable values over the whole range that you want to plot, e.g., 0 to 20. You will want something like:
x_coords <- seq(from=0, to=20, by=0.1)
y_coords <- predict(fitted_model, newdata=data.frame(x=x_coords))
plot(x, y, xlim=c(0,20))
points(x=x_coords, y=y_coords, type="l")
My answer here (link) provides a worked example using the Auto dataset from the ISLR package.

Time Series Forecasting using Support Vector Machine (SVM) in R

I've tried searching but couldn't find a specific answer to this question. So far I'm able to realize that Time Series Forecasting is possible using SVM. I've gone through a few papers/articles who've performed the same but didn't mention any code, instead explained the algorithm (which I didn't quite understand). And some have done it using python.
My problem here is that: I have a company data(say univariate) of sales from 2010 to 2017. And I need to forecast the sales value for 2018 using SVM in R.
Would you be kind enough to simply present and explain the R code to perform the same using a small example?
I really do appreciate your inputs and efforts!
Thanks!!!
let's assume you have monthly data, for example derived from Air Passengers data set. You don't need the timeseries-type data, just a data frame containing time steps and values. Let's name them x and y. Next you develop an svm model, and specify the time steps you need to forecast. Use the predict function to compute the forecast for given time steps. That's it. However, support vector machine is not commonly regarded as the best method for time series forecasting, especially for long series of data. It can perform good for few observations ahead, but I wouldn't expect good results for forecasting eg. daily data for a whole next year (but it obviously depends on data). Simple R code for SVM-based forecast:
# prepare sample data in the form of data frame with cols of timesteps (x) and values (y)
data(AirPassengers)
monthly_data <- unclass(AirPassengers)
months <- 1:144
DF <- data.frame(months,monthly_data)
colnames(DF)<-c("x","y")
# train an svm model, consider further tuning parameters for lower MSE
svmodel <- svm(y ~ x,data=DF, type="eps-regression",kernel="radial",cost=10000, gamma=10)
#specify timesteps for forecast, eg for all series + 12 months ahead
nd <- 1:156
#compute forecast for all the 156 months
prognoza <- predict(svmodel, newdata=data.frame(x=nd))
#plot the results
ylim <- c(min(DF$y), max(DF$y))
xlim <- c(min(nd),max(nd))
plot(DF$y, col="blue", ylim=ylim, xlim=xlim, type="l")
par(new=TRUE)
plot(prognoza, col="red", ylim=ylim, xlim=xlim)

How I can plot a expected richness curve (Chao1) via vegan in R

I have a dataset from one site containing data on species and its abundance (number of individuals for each species in sample).
I use vegan package for alpha diversity analysis.
For instance, I plot a species rarefraction curve via rarecurve function (I cann't use a specaccum function becouse I have data from one site), and calculate a Chao1 index via estimateR function.
How I can plot a Chao1 expected richness curve using estimateR function? Then, I would like to combine these curves on one single plot.
library(vegan)
TR <- matrix(nrow=1,c(3,1,1,17,1,1,1,1,1,2,1,1,3,13,31,24,6,1,1,4,1,10,2,3,1,5,6,1,1,1,4,16,17,15,6,9,66,3,1,3,24,15,2,3,17,1,7,2,27,13,2,1,1,3,1,3,30,7,1,1,4,1,2,5,1,1,6,2,1,9,11,5,8,7,2,2,2,1,13,3,8,4,1,5,27,1,62,13,6,7,7,4,9,1,7,7,1,25,1,5,3,1,2,1,1,5,2,73,25,17,43,88,2,3,38,4,5,6,6,16,2,13,10,7,1,2,9,3,1,3,1,8,4,4,5,13,2,25,9,2,1,12,29,4,1,9,1,1,3,4,2,9,4,26,2,7,4,18,1,10,10,4,6,5,20,1,2,11,1,3,1,2,1,1,12,3,2,1,4,24,7,22,19,43,2,9,18,1,1,1,9,7,6,1,8,2,2,19,7,26,4,4,1,3,4,5,2,4,8,2,3,1,5,5,1,11,6,6,2,4,3,1,10,6,9,16,1,1,32,1,1,31,2,12,2,13,1,2,9,13,1,11,8,1,14,5,9,1,3,1,7,1,1,13,17,1,1,3,2,9,1,4,1,7,2,2,9,24,20,2,1,2,2,1,9,5,1,1,23,13,7,1,8,5,47,32,6,13,16,8,2,1,5,4,3,1,2,1,1,1,3,14,6,21,2,7,2,2,16,2,10,21,18,2,1,3,33,12,55,4,1,5,14,3,10,2,4,1,2,5,7,6,2,12,14,28,18,30,28,7,1,1,1,3,4,2,17,60,31,3,3,2,2,3,6,2,6,1,13,2,3,13,7,2,10,19,9,7,1,3))
num_species=specnumber(TR)
chao1=estimateR(TR)[2,]
shannon=diversity(TR,"shannon")
rarecurve(TR)
estimateR(TR)
Here is a plot, building on EstimateS output (I input the same data) with SigmaPlot:
Thin line is expected richness - Chao1. In R I can plot only SAC.
In EstimateS I get a set with data for all 2990 individuals, but not in R.
I don't know how things are done in estimateS, but it looks like the extended richness (Chao 1) curve is based on the mean of random subsamples of the community. This could be done like this:
subchao <- sapply(1:2990, function(i)
mean(sapply(1:100, function(...) estimateR(rrarefy(TR, i))[2,])))
This would randomly rarefy (rrarefy()) to all sample sizes from 1 to 2990 and find the mean from 100 replicates of each. This will take time.

Convert double differenced forecast into actual value diff() in R

I have already read
Time Series Forecast: Convert differenced forecast back to before difference level
and
How to "undifference" a time series variable
None of these unfortunately gives any clear answer how to convert forecast done in ARIMA using differenced method(diff()) to reach at stationary series.
code sample.
## read data and start from 1 jan 2014
dat<-read.csv("rev forecast 2014-23 dec 2015.csv")
val.ts <- ts(dat$Actual,start=c(2014,1,1),freq=365)
##Check how we can get stationary series
plot((diff(val.ts)))
plot(diff(diff(val.ts)))
plot(log(val.ts))
plot(log(diff(val.ts)))
plot(sqrt(val.ts))
plot(sqrt(diff(val.ts)))
##I found that double differencing. i.e.diff(diff(val.ts)) gives stationary series.
#I ran below code to get value of 3 parameters for ARIMA from auto.arima
ARIMAfit <- auto.arima(diff(diff(val.ts)), approximation=FALSE,trace=FALSE, xreg=diff(diff(xreg)))
#Finally ran ARIMA
fit <- Arima(diff(diff(val.ts)),order=c(5,0,2),xreg = diff(diff(xreg)))
#plot original to see fit
plot(diff(diff(val.ts)),col="orange")
#plot fitted
lines(fitted(fit),col="blue")
This gives me a perfect fit time series. However, how do i reconvert fitted values into their original metric from the current form it is now in? i mean from double differencing into actual number? For log i know we can do 10^fitted(fit) for square root there is similar solution, however what to do for differencing, that too double differencing?
Any help on this please in R? After days of rigorous exercise, i am stuck at this point.
i ran test to check if differencing has any impact on model fit of auto.arima function and found that it does. so auto.arima can't handle non stationary series and it requires some effort on part of analyst to convert the series to stationary.
Firstly, auto.arima without any differencing. Orange color is actual value, blue is fitted.
ARIMAfit <- auto.arima(val.ts, approximation=FALSE,trace=FALSE, xreg=xreg)
plot(val.ts,col="orange")
lines(fitted(ARIMAfit),col="blue")
secondly, i tried differencing
ARIMAfit <- auto.arima(diff(val.ts), approximation=FALSE,trace=FALSE, xreg=diff(xreg))
plot(diff(val.ts),col="orange")
lines(fitted(ARIMAfit),col="blue")
enter image description here
thirdly, i did differencing 2 times.
ARIMAfit <- auto.arima(diff(diff(val.ts)), approximation=FALSE,trace=FALSE,
xreg=diff(diff(xreg)))
plot(diff(diff(val.ts)),col="orange")
lines(fitted(ARIMAfit),col="blue")
enter image description here
A visual inspection can suggest that 3rd graph is more accurate out of all. This i am aware of. The challenge is how to reconvert this fitted value which is in the form of double differenced form into the actual metric!
The opposite of diff is kind of cumsum, but you need to know the starting values at each diff.
e.g:
set.seed(1234)
x <- runif(100)
z <- cumsum(c(x[1], cumsum(c(diff(x)[1], diff(diff(x))))))
all.equal(z, x)
[1] TRUE
Share some of your data to make a reproducible example to better help answer the question.
If you expect that differencing will be necessary to obtain stationarity, then why not simply include the maximum differencing order in the function call? That is, the "I" in ARIMA is the order of differencing prior to fitting an ARMA model, such that if
y = diff(diff(x)) and y is an ARMA(p,q) process,
then
x follows an ARIMA(p,2,q) process.
In auto.arima() you specify the differencing with the d argument (or D if it involves seasons). So, you want something like this (for a maximum of 3 differences):
fit <- auto.arima(val.ts, d=3, ...)
From this, you can verify that the fitted values will indeed map onto the original data
plot(val.ts)
lines(fit, col="blue")
In the example below containing dummy data, I have double differenced. First, I removed seasonality (lag = 12) and then I removed trend from the differenced data (lag = 1).
set.seed(1234)
x <- rep(NA,24)
x <- x %>%
rnorm(mean = 10, sd = 5) %>%
round(.,0) %>%
abs()
yy <- diff(x, lag = 12)
z <- diff(yy, lag = 1)
Using the script that #jeremycg included above and I include below, how would I remove the double difference? Would I need to add lag specifiers to the two nested diff() commands? If so, which diff() would have the lag = 12 specifier and which would have the lag = 1?
zz <- cumsum(c(x[1], cumsum(c(diff(x)[1], diff(diff(x))))))

Time series forecasting, dealing with known big orders

I have many data sets with known outliers (big orders)
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1", 155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5, 135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6, 222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6, 231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6, 329429882.8, 264012891.6, 496745973.9, 284484362.55),ncol=2,byrow=FALSE)
The top 11 outliers of this specific series are:
outliers <- matrix(c("14Q4","14Q2","12Q1","13Q1","14Q2","11Q1","11Q4","14Q2","13Q4","14Q4","13Q1",20193525.68, 18319234.7, 12896323.62, 12718744.01, 12353002.09, 11936190.13, 11356476.28, 11351192.31, 10101527.85, 9723641.25, 9643214.018),ncol=2,byrow=FALSE)
What methods are there that i can forecast the time series taking these outliers into consideration?
I have already tried replacing the next biggest outlier (so running the data set 10 times replacing the outliers with the next biggest until the 10th data set has all the outliers replaced).
I have also tried simply removing the outliers (so again running the data set 10 times removing an outlier each time until all 10 are removed in the 10th data set)
I just want to point out that removing these big orders does not delete the data point completely as there are other deals that happen in that quarter
My code tests the data through multiple forecasting models (ARIMA weighted on the out sample, ARIMA weighted on the in sample, ARIMA weighted, ARIMA, Additive Holt-winters weighted and Multiplcative Holt-winters weighted) so it needs to be something that can be adapted to these multiple models.
Here are a couple more data sets that i used, i do not have the outliers for these series yet though
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3", 26393.99306, 13820.5037, 23115.82432, 25894.41036, 14926.12574, 15855.8857, 21565.19002, 49373.89675, 27629.10141, 43248.9778, 34231.73851, 83379.26027, 54883.33752, 62863.47728, 47215.92508, 107819.9903, 53239.10602, 71853.5, 59912.7624, 168416.2995, 64565.6211, 94698.38748, 80229.9716, 169205.0023, 70485.55409, 133196.032, 78106.02227), ncol=2,byrow=FALSE)
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3",3311.5124, 3459.15634, 2721.486863, 3286.51708, 3087.234059, 2873.810071, 2803.969394, 4336.4792, 4722.894582, 4382.349583, 3668.105825, 4410.45429, 4249.507839, 3861.148928, 3842.57616, 5223.671347, 5969.066896, 4814.551389, 3907.677816, 4944.283864, 4750.734617, 4440.221993, 3580.866991, 3942.253996, 3409.597269, 3615.729974, 3174.395507),ncol=2,byrow=FALSE)
If this is too complicated then an explanation of how, in R, once outliers are detected using certain commands, the data is dealt with to forecast. e.g smoothing etc and how i can approach that writing a code myself (not using the commands that detect outliers)
Your outliers appear to be seasonal variations with the largest orders appearing in the 4-th quarter. Many of the forecasting models you mentioned include the capability for seasonal adjustments. As an example, the simplest model could have a linear dependence on year with corrections for all seasons. Code would look like:
df <- data.frame(period= c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3",
"10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2",
"13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1"),
order= c(155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5,
135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6,
222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6,
231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6,
329429882.8, 264012891.6, 496745973.9, 42748656.73))
seasonal <- data.frame(year=as.numeric(substr(df$period, 1,2)), qtr=substr(df$period, 3,4), data=df$order)
ord_model <- lm(data ~ year + qtr, data=seasonal)
seasonal <- cbind(seasonal, fitted=ord_model$fitted)
library(reshape2)
library(ggplot2)
plot_fit <- melt(seasonal,id.vars=c("year", "qtr"), variable.name = "Source", value.name="Order" )
ggplot(plot_fit, aes(x=year, y = Order, colour = qtr, shape=Source)) + geom_point(size=3)
which gives the results shown in the chart below:
Models with a seasonal adjustment but nonlinear dependence upon year may give better fits.
You already said you tried different Arima-models, but as mentioned by WaltS, your series don't seem to contain big outliers, but a seasonal-component, which is nicely captured by auto.arima() in the forecast package:
myTs <- ts(as.numeric(data[,2]), start=c(2008, 1), frequency=4)
myArima <- auto.arima(myTs, lambda=0)
myForecast <- forecast(myArima)
plot(myForecast)
where the lambda=0 argument to auto.arima() forces a transformation (or you could take log) of the data by boxcox to take the increasing amplitude of the seasonal-component into account.
The approach you are trying to use to cleanse your data of outliers is not going to be robust enough to identify them. I should add that there is a free outlier package in R called tsoutliers, but it won't do the things I am about to show you....
You have an interesting time series here. The trend changes over time with the upward trend weakening a bit. If you bring in two time trend variables with the first beginning at 1 and another beginning at period 14 and forward you will capture this change. As for seasonality, you can capture the high 4th quarter with a dummy variable. The model is parsimonios as the other 3 quarters are not different from the average plus no need for an AR12, seasonal differencing or 3 seasonal dummies. You can also capture the impact of the last two observations being outliers with two dummy variables. Ignore the 49 above the word trend as that is just the name of the series being modeled.

Resources