Predict the next values based on the given data in R

Predict the next values based on the given data in R - r

I have a vector which represents violations in each year.How to predict the violations in the next years in R.
year <- c(190519, 223721, 235321, 101934)
Please help me out

To illustrate the comments made by akash87 and Dominic Comtols that it would be futile to predict with little information, here's a linear model method and visualisation with ggplot:
year<-c(190519 ,223721, 235321, 101934)
df <- data.frame(year=1:4, crime= year)
library(ggplot2)
ggplot(df, aes(x=year, y=crime)) +
geom_point() +
geom_smooth(method="lm", fullrange=T) +
xlim(1,6)
As seen from the plot, the predicted value by extrapolating the linear model in Year 6 can be anyway within the gray area, i.e between -339737 and 537576. You're better off just guess...

The dataset is too small for a reliable forecast, but you could try the following, just to illustrate a possibility on how time series forecasts could be obtained in principle:
year <- c(190519, 223721, 235321, 101934)
library(forecast)
yearforecasts <- HoltWinters(as.ts(year), beta=FALSE, gamma=FALSE)
yearforecasts2 <- forecast.HoltWinters(yearforecasts,h=1)
> yearforecasts2
# Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
#5 190518.3 95821.09 285215.5 45691.42 335345.2
plot.forecast(yearforecasts2)
The forecast is inaccurate and has a large error margin due to the very small number of data points. As pointed out at the beginning of this answer and in the comments, more data is required for a useful forecast. For the same reason, it is not possible to forecast more than one year ahead with this method.

Related

Reduce range of function for functional PCA in R - Functional Data Analysis

I have discrete measurements of river flow spanning 22 years. As river flow is naturally continuous, I have attempted to fit a function to the data.
library(FDA)
set.seed(1)
### 3 years of flow data
base = c(1,1,1,1,1,2,2,1,2,2,3,3,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,6,5,5,4,4,4,3,4,3,3,3,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1)
year1 = sapply(base, function(x){x + runif(1)})
year2 = sapply(base, function(x){x + runif(1)})
year3 = sapply(base, function(x){x + runif(1)})
flow.mat = matrix(c(year1, year2, year3), ncol = 3)
Whilst Fourier basis systems are recommended for periodic data, the true data does not exhibit a strongly repeating pattern (ignore data simulation for this assumption). It also contains important extreme values. Therefore, I attempted to fit bSpline basis systems to the data.
sp.basis=create.bspline.basis(c(1,length(base)), norder=6, nbasis=15)
sb.fd=smooth.basis(1:length(base), flow.mat, sp.basis)$fd
Ultimately, I intend on using the flow data as a covariate in a regression model with a monthly interval. This poses an issue as I fit annual functions to the data, as this provided an improved fit for monthly data, given the data lack of temporal independence.
Therefore, I was wondering if it was possible for me to subset the generated functions, selecting a month at a time.
I suspect this is not possible, therefore, is it possible to run a fPCA on subsetted data, as I intend on using the fPCA scores as the covariate in the model?
So far I have been completely unsuccessful in running a subsetted fPCA. Instead, I have been obtaining annual scores via the following:
pca.flow=pca.fd(sb.fd, 2)

Without getting into much sophistication, I just plotted your data and made a polynomial fit. I did use a 4 degree polynomial because it is wave with 3 ups and downs (4 is one more than the extrema of the fitting curve). A a matter of facts, degree 5 or more did not gave a significant improvement.
What about doing the same for you 22 years time series?

Time Series Forecasting using Support Vector Machine (SVM) in R

I've tried searching but couldn't find a specific answer to this question. So far I'm able to realize that Time Series Forecasting is possible using SVM. I've gone through a few papers/articles who've performed the same but didn't mention any code, instead explained the algorithm (which I didn't quite understand). And some have done it using python.
My problem here is that: I have a company data(say univariate) of sales from 2010 to 2017. And I need to forecast the sales value for 2018 using SVM in R.
Would you be kind enough to simply present and explain the R code to perform the same using a small example?
I really do appreciate your inputs and efforts!
Thanks!!!

let's assume you have monthly data, for example derived from Air Passengers data set. You don't need the timeseries-type data, just a data frame containing time steps and values. Let's name them x and y. Next you develop an svm model, and specify the time steps you need to forecast. Use the predict function to compute the forecast for given time steps. That's it. However, support vector machine is not commonly regarded as the best method for time series forecasting, especially for long series of data. It can perform good for few observations ahead, but I wouldn't expect good results for forecasting eg. daily data for a whole next year (but it obviously depends on data). Simple R code for SVM-based forecast:
# prepare sample data in the form of data frame with cols of timesteps (x) and values (y)
data(AirPassengers)
monthly_data <- unclass(AirPassengers)
months <- 1:144
DF <- data.frame(months,monthly_data)
colnames(DF)<-c("x","y")
# train an svm model, consider further tuning parameters for lower MSE
svmodel <- svm(y ~ x,data=DF, type="eps-regression",kernel="radial",cost=10000, gamma=10)
#specify timesteps for forecast, eg for all series + 12 months ahead
nd <- 1:156
#compute forecast for all the 156 months
prognoza <- predict(svmodel, newdata=data.frame(x=nd))
#plot the results
ylim <- c(min(DF$y), max(DF$y))
xlim <- c(min(nd),max(nd))
plot(DF$y, col="blue", ylim=ylim, xlim=xlim, type="l")
par(new=TRUE)
plot(prognoza, col="red", ylim=ylim, xlim=xlim)

Plot linear and multiple linear reg on the same graph (ggplot)

I have for instance this data frame :
data <- data.frame(
x=c(1:12)
, case=c(3,5,1,8,2,4,5,0,8,2,3,5)
, rain=c(1,8,2,1,4,5,3,0,8,2,3,4)
, country=c("A","A","A","A","B","B","B","B","C","C","C","C")
, year=rep(seq(2000,2003,1),3)
)
I would like to perform 2 linear regressions and plot them on one graph.
In a nutshell, I would like to compare the crude trend of cases over time (simple lm) with the same trend of cases but this time adjusted to rainfall over the years 2000 to 2003, on one and same graph.
model<-lm(case~ year, data=data)
the second one would be a multiple linear regression. I used this code for the purpose, but not sure it is ideal.
modelrain<-lm(case~ I(year +rain), data=data)
I did it with a simple plot with abline, but don't know how to make it with ggplot. I've created a new dataframe, but doesn't seem to work perfectly ( so I don't put the rest of my code here).
Thank you very much

Building off the suggestions in the comments there are three valid regression models
model1<-lm(case~ year, data=data)
summary(model1)
model2<-lm(case~ year+rain, data=data)
summary(model2)
model3<-lm(case~ year*rain, data=data)
summary(model3)
With the limited data we have doesn't seem to be a lot going on.
The first question on how to plot the regression line for model1 using ggplot is just:
ggplot(data,aes(x=year,y=case)) + geom_point() + geom_lm()
As others have noted it is unclear what user3355655 means by "adjusted" for rain (since rain and year can't truly exist on the same x axis) but if we're willing to take the simplest course and simply treat rain as a "factor" then:
ggplot(data,aes(x=year,y=case,color=factor(rain))) + geom_point() + geom_smooth(method="lm",fill=NA) + scale_y_continuous(limits = c(-1, 10))

constructing an error bar with compositional data

I have a problem and hopefully there is somebody to help me.
I have a data set with compositional data, for each weekday of 160 weeks the ratio of cars are measured. The sum of the three ratios sum up to 1. There are three types of cars in this research.
My task is to construct the mean and an 'error bar'. I used the following lines of code in R:
Day = rep(c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday",
"Saterday"),3)
cars = c(rep("nissan",7),rep("toyota",7),rep("bmw",7))
y <- colMeans(datadag,na.rm=TRUE)
delta <- apply(datadag,2,sd,na.rm=TRUE)
df=data.frame(Day,cars,y,delta)
p<-ggplot(df,aes(x=Day,y=y,group=Device,color=Device))+
geom_point() +
geom_errorbar(aes(ymin=y-delta,ymax=y+delta),width=.6)
print(p)
The code above give the following plot:
The problem I face is that the error bounds exceeds the 0 and 1 which is not possible because of the compositional data. Can anybody tell me what I did wrong?

Your problem is statistical, not to do with R. You are assuming that the standard deviation will "know" that your data cannot be negative. Consider the following.
foo <- c(0,0,1,1000)
mean(foo) - sd(foo)
[1] -249.5836
I am not sure if the same problem can arise with the standard error but I suspect it can...

Time series forecasting, dealing with known big orders

I have many data sets with known outliers (big orders)
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1", 155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5, 135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6, 222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6, 231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6, 329429882.8, 264012891.6, 496745973.9, 284484362.55),ncol=2,byrow=FALSE)
The top 11 outliers of this specific series are:
outliers <- matrix(c("14Q4","14Q2","12Q1","13Q1","14Q2","11Q1","11Q4","14Q2","13Q4","14Q4","13Q1",20193525.68, 18319234.7, 12896323.62, 12718744.01, 12353002.09, 11936190.13, 11356476.28, 11351192.31, 10101527.85, 9723641.25, 9643214.018),ncol=2,byrow=FALSE)
What methods are there that i can forecast the time series taking these outliers into consideration?
I have already tried replacing the next biggest outlier (so running the data set 10 times replacing the outliers with the next biggest until the 10th data set has all the outliers replaced).
I have also tried simply removing the outliers (so again running the data set 10 times removing an outlier each time until all 10 are removed in the 10th data set)
I just want to point out that removing these big orders does not delete the data point completely as there are other deals that happen in that quarter
My code tests the data through multiple forecasting models (ARIMA weighted on the out sample, ARIMA weighted on the in sample, ARIMA weighted, ARIMA, Additive Holt-winters weighted and Multiplcative Holt-winters weighted) so it needs to be something that can be adapted to these multiple models.
Here are a couple more data sets that i used, i do not have the outliers for these series yet though
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3", 26393.99306, 13820.5037, 23115.82432, 25894.41036, 14926.12574, 15855.8857, 21565.19002, 49373.89675, 27629.10141, 43248.9778, 34231.73851, 83379.26027, 54883.33752, 62863.47728, 47215.92508, 107819.9903, 53239.10602, 71853.5, 59912.7624, 168416.2995, 64565.6211, 94698.38748, 80229.9716, 169205.0023, 70485.55409, 133196.032, 78106.02227), ncol=2,byrow=FALSE)
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3",3311.5124, 3459.15634, 2721.486863, 3286.51708, 3087.234059, 2873.810071, 2803.969394, 4336.4792, 4722.894582, 4382.349583, 3668.105825, 4410.45429, 4249.507839, 3861.148928, 3842.57616, 5223.671347, 5969.066896, 4814.551389, 3907.677816, 4944.283864, 4750.734617, 4440.221993, 3580.866991, 3942.253996, 3409.597269, 3615.729974, 3174.395507),ncol=2,byrow=FALSE)
If this is too complicated then an explanation of how, in R, once outliers are detected using certain commands, the data is dealt with to forecast. e.g smoothing etc and how i can approach that writing a code myself (not using the commands that detect outliers)

Your outliers appear to be seasonal variations with the largest orders appearing in the 4-th quarter. Many of the forecasting models you mentioned include the capability for seasonal adjustments. As an example, the simplest model could have a linear dependence on year with corrections for all seasons. Code would look like:
df <- data.frame(period= c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3",
"10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2",
"13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1"),
order= c(155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5,
135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6,
222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6,
231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6,
329429882.8, 264012891.6, 496745973.9, 42748656.73))
seasonal <- data.frame(year=as.numeric(substr(df$period, 1,2)), qtr=substr(df$period, 3,4), data=df$order)
ord_model <- lm(data ~ year + qtr, data=seasonal)
seasonal <- cbind(seasonal, fitted=ord_model$fitted)
library(reshape2)
library(ggplot2)
plot_fit <- melt(seasonal,id.vars=c("year", "qtr"), variable.name = "Source", value.name="Order" )
ggplot(plot_fit, aes(x=year, y = Order, colour = qtr, shape=Source)) + geom_point(size=3)
which gives the results shown in the chart below:
Models with a seasonal adjustment but nonlinear dependence upon year may give better fits.

You already said you tried different Arima-models, but as mentioned by WaltS, your series don't seem to contain big outliers, but a seasonal-component, which is nicely captured by auto.arima() in the forecast package:
myTs <- ts(as.numeric(data[,2]), start=c(2008, 1), frequency=4)
myArima <- auto.arima(myTs, lambda=0)
myForecast <- forecast(myArima)
plot(myForecast)
where the lambda=0 argument to auto.arima() forces a transformation (or you could take log) of the data by boxcox to take the increasing amplitude of the seasonal-component into account.

The approach you are trying to use to cleanse your data of outliers is not going to be robust enough to identify them. I should add that there is a free outlier package in R called tsoutliers, but it won't do the things I am about to show you....
You have an interesting time series here. The trend changes over time with the upward trend weakening a bit. If you bring in two time trend variables with the first beginning at 1 and another beginning at period 14 and forward you will capture this change. As for seasonality, you can capture the high 4th quarter with a dummy variable. The model is parsimonios as the other 3 quarters are not different from the average plus no need for an AR12, seasonal differencing or 3 seasonal dummies. You can also capture the impact of the last two observations being outliers with two dummy variables. Ignore the 49 above the word trend as that is just the name of the series being modeled.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Predict the next values based on the given data in R - r

I have a vector which represents violations in each year.How to predict the violations in the next years in R. year <- c(190519, 223721, 235321, 101934) Please help me out

Related

Reduce range of function for functional PCA in R - Functional Data Analysis

Time Series Forecasting using Support Vector Machine (SVM) in R

Plot linear and multiple linear reg on the same graph (ggplot)

constructing an error bar with compositional data

Time series forecasting, dealing with known big orders

Categories

Resources