I have many data sets with known outliers (big orders)
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1", 155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5, 135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6, 222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6, 231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6, 329429882.8, 264012891.6, 496745973.9, 284484362.55),ncol=2,byrow=FALSE)
The top 11 outliers of this specific series are:
outliers <- matrix(c("14Q4","14Q2","12Q1","13Q1","14Q2","11Q1","11Q4","14Q2","13Q4","14Q4","13Q1",20193525.68, 18319234.7, 12896323.62, 12718744.01, 12353002.09, 11936190.13, 11356476.28, 11351192.31, 10101527.85, 9723641.25, 9643214.018),ncol=2,byrow=FALSE)
What methods are there that i can forecast the time series taking these outliers into consideration?
I have already tried replacing the next biggest outlier (so running the data set 10 times replacing the outliers with the next biggest until the 10th data set has all the outliers replaced).
I have also tried simply removing the outliers (so again running the data set 10 times removing an outlier each time until all 10 are removed in the 10th data set)
I just want to point out that removing these big orders does not delete the data point completely as there are other deals that happen in that quarter
My code tests the data through multiple forecasting models (ARIMA weighted on the out sample, ARIMA weighted on the in sample, ARIMA weighted, ARIMA, Additive Holt-winters weighted and Multiplcative Holt-winters weighted) so it needs to be something that can be adapted to these multiple models.
Here are a couple more data sets that i used, i do not have the outliers for these series yet though
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3", 26393.99306, 13820.5037, 23115.82432, 25894.41036, 14926.12574, 15855.8857, 21565.19002, 49373.89675, 27629.10141, 43248.9778, 34231.73851, 83379.26027, 54883.33752, 62863.47728, 47215.92508, 107819.9903, 53239.10602, 71853.5, 59912.7624, 168416.2995, 64565.6211, 94698.38748, 80229.9716, 169205.0023, 70485.55409, 133196.032, 78106.02227), ncol=2,byrow=FALSE)
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3",3311.5124, 3459.15634, 2721.486863, 3286.51708, 3087.234059, 2873.810071, 2803.969394, 4336.4792, 4722.894582, 4382.349583, 3668.105825, 4410.45429, 4249.507839, 3861.148928, 3842.57616, 5223.671347, 5969.066896, 4814.551389, 3907.677816, 4944.283864, 4750.734617, 4440.221993, 3580.866991, 3942.253996, 3409.597269, 3615.729974, 3174.395507),ncol=2,byrow=FALSE)
If this is too complicated then an explanation of how, in R, once outliers are detected using certain commands, the data is dealt with to forecast. e.g smoothing etc and how i can approach that writing a code myself (not using the commands that detect outliers)
Your outliers appear to be seasonal variations with the largest orders appearing in the 4-th quarter. Many of the forecasting models you mentioned include the capability for seasonal adjustments. As an example, the simplest model could have a linear dependence on year with corrections for all seasons. Code would look like:
df <- data.frame(period= c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3",
"10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2",
"13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1"),
order= c(155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5,
135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6,
222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6,
231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6,
329429882.8, 264012891.6, 496745973.9, 42748656.73))
seasonal <- data.frame(year=as.numeric(substr(df$period, 1,2)), qtr=substr(df$period, 3,4), data=df$order)
ord_model <- lm(data ~ year + qtr, data=seasonal)
seasonal <- cbind(seasonal, fitted=ord_model$fitted)
library(reshape2)
library(ggplot2)
plot_fit <- melt(seasonal,id.vars=c("year", "qtr"), variable.name = "Source", value.name="Order" )
ggplot(plot_fit, aes(x=year, y = Order, colour = qtr, shape=Source)) + geom_point(size=3)
which gives the results shown in the chart below:
Models with a seasonal adjustment but nonlinear dependence upon year may give better fits.
You already said you tried different Arima-models, but as mentioned by WaltS, your series don't seem to contain big outliers, but a seasonal-component, which is nicely captured by auto.arima() in the forecast package:
myTs <- ts(as.numeric(data[,2]), start=c(2008, 1), frequency=4)
myArima <- auto.arima(myTs, lambda=0)
myForecast <- forecast(myArima)
plot(myForecast)
where the lambda=0 argument to auto.arima() forces a transformation (or you could take log) of the data by boxcox to take the increasing amplitude of the seasonal-component into account.
The approach you are trying to use to cleanse your data of outliers is not going to be robust enough to identify them. I should add that there is a free outlier package in R called tsoutliers, but it won't do the things I am about to show you....
You have an interesting time series here. The trend changes over time with the upward trend weakening a bit. If you bring in two time trend variables with the first beginning at 1 and another beginning at period 14 and forward you will capture this change. As for seasonality, you can capture the high 4th quarter with a dummy variable. The model is parsimonios as the other 3 quarters are not different from the average plus no need for an AR12, seasonal differencing or 3 seasonal dummies. You can also capture the impact of the last two observations being outliers with two dummy variables. Ignore the 49 above the word trend as that is just the name of the series being modeled.
i'm new to R, so I'm having trouble with this time series data
For example (the real data is way larger)
data <- c(7,5,3,2,5,2,4,11,5,4,7,22,5,14,18,20,14,22,23,20,23,16,21,23,42,64,39,34,39,43,49,59,30,15,10,12,4,2,4,6,7)
ts <- ts(data,frequency = 12, start = c(2010,1))
So if I try to decompose the data to adjust it
ts.decompose <- decompose(ts)
ts.adjust <- ts - ts.decompose$seasonal
ts.hw <- HoltWinters(ts.adjust)
ts.forecast <- forecast.HoltWinters(ts.hw, h = 10)
plot.forecast(ts.forecast)
But for the first values I got negative values, why this is happening?
Well, you are forecasting the seasonally adjusted time series, and of course the deseasonalized series ts.adjust can already contain negative values by itself, and in fact, it actually does.
In addition, even if the original series contained only positive values, Holt-Winters can yield negative forecasts. It is not constrained.
I would suggest trying to model your original (not seasonally adjusted) time series directly using ets() in the forecast package. It usually does a good job in detecting seasonality. (And it can also yield negative forecasts or prediction intervals.)
I very much recommend this free online forecasting textbook. Given your specific question, this may also be helpful.
I have very short time series of data from a climatic experiment that I did back in 2012. The data consists of daily water solution flux and daily CO2 flux data. The CO2 flux data comprises 52 days and the water solution flux data is only 7 days long. I have several measurements a day for the CO2 flux data but I calculated daily averages.
Now, I would like to know if there is a trend in these time series. I have figured out that I can use the Kendall trend test or a Theil-Sen trend estimator. I used the Kendall test before for a time series spanning several years. I don't know how to use the Theil-Sen trend estimator.
I put my data into an ts object in R, but when I tried doing a decompositon (using the function decompose) I get the error that the time series is spanning less than 2 periods. I would like to extract the trend data and then do a Mann-Kendall test on it.
Here is the code that I got so far:
myexample <- structure(c(624.27, 682.06, 672.77,
765.96, 759.52, 760.38, 742.81
), .Names = c("Day1", "Day2", "Day3", "Day4", "Day5", "Day6",
"Day7"))
ts.object <- ts(myexample, frequency = 365, start = 1)
decomp.ts.obj <- decompose(ts.obj, type = "mult", filter=NULL)
# Error in decompose(ts.obj, type = "mult", filter = NULL)
Can anyone help me on how to do a trend analysis with my very short time series? Any google-fu was to no avail. And, can someone tell me what the size of the Kendall tau means? It spans values from -1 to 1. Is a tau=0.5 a strong or a weak trend?
Thanks,
Stefan
I would be tempted to do something simple like
d <- data.frame(val=myexample,ind=seq(myexample))
summary(lm(val~ind,d))
or
library(lmPerm)
summary(lmp(val~ind,d))
or
cor.test(~val+ind,data=d,method="kendall")
Whether tau=0.5 is strong or weak depends a lot on context. In this case the p-value is 0.24, which says at least that this signal (based on rank alone) is not distinguishable from an appropriate null signal.