I have discrete measurements of river flow spanning 22 years. As river flow is naturally continuous, I have attempted to fit a function to the data.
library(FDA)
set.seed(1)
### 3 years of flow data
base = c(1,1,1,1,1,2,2,1,2,2,3,3,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,6,5,5,4,4,4,3,4,3,3,3,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1)
year1 = sapply(base, function(x){x + runif(1)})
year2 = sapply(base, function(x){x + runif(1)})
year3 = sapply(base, function(x){x + runif(1)})
flow.mat = matrix(c(year1, year2, year3), ncol = 3)
Whilst Fourier basis systems are recommended for periodic data, the true data does not exhibit a strongly repeating pattern (ignore data simulation for this assumption). It also contains important extreme values. Therefore, I attempted to fit bSpline basis systems to the data.
sp.basis=create.bspline.basis(c(1,length(base)), norder=6, nbasis=15)
sb.fd=smooth.basis(1:length(base), flow.mat, sp.basis)$fd
Ultimately, I intend on using the flow data as a covariate in a regression model with a monthly interval. This poses an issue as I fit annual functions to the data, as this provided an improved fit for monthly data, given the data lack of temporal independence.
Therefore, I was wondering if it was possible for me to subset the generated functions, selecting a month at a time.
I suspect this is not possible, therefore, is it possible to run a fPCA on subsetted data, as I intend on using the fPCA scores as the covariate in the model?
So far I have been completely unsuccessful in running a subsetted fPCA. Instead, I have been obtaining annual scores via the following:
pca.flow=pca.fd(sb.fd, 2)
Without getting into much sophistication, I just plotted your data and made a polynomial fit. I did use a 4 degree polynomial because it is wave with 3 ups and downs (4 is one more than the extrema of the fitting curve). A a matter of facts, degree 5 or more did not gave a significant improvement.
What about doing the same for you 22 years time series?
I have for instance this data frame :
data <- data.frame(
x=c(1:12)
, case=c(3,5,1,8,2,4,5,0,8,2,3,5)
, rain=c(1,8,2,1,4,5,3,0,8,2,3,4)
, country=c("A","A","A","A","B","B","B","B","C","C","C","C")
, year=rep(seq(2000,2003,1),3)
)
I would like to perform 2 linear regressions and plot them on one graph.
In a nutshell, I would like to compare the crude trend of cases over time (simple lm) with the same trend of cases but this time adjusted to rainfall over the years 2000 to 2003, on one and same graph.
model<-lm(case~ year, data=data)
the second one would be a multiple linear regression. I used this code for the purpose, but not sure it is ideal.
modelrain<-lm(case~ I(year +rain), data=data)
I did it with a simple plot with abline, but don't know how to make it with ggplot. I've created a new dataframe, but doesn't seem to work perfectly ( so I don't put the rest of my code here).
Thank you very much
Building off the suggestions in the comments there are three valid regression models
model1<-lm(case~ year, data=data)
summary(model1)
model2<-lm(case~ year+rain, data=data)
summary(model2)
model3<-lm(case~ year*rain, data=data)
summary(model3)
With the limited data we have doesn't seem to be a lot going on.
The first question on how to plot the regression line for model1 using ggplot is just:
ggplot(data,aes(x=year,y=case)) + geom_point() + geom_lm()
As others have noted it is unclear what user3355655 means by "adjusted" for rain (since rain and year can't truly exist on the same x axis) but if we're willing to take the simplest course and simply treat rain as a "factor" then:
ggplot(data,aes(x=year,y=case,color=factor(rain))) + geom_point() + geom_smooth(method="lm",fill=NA) + scale_y_continuous(limits = c(-1, 10))
I have a problem and hopefully there is somebody to help me.
I have a data set with compositional data, for each weekday of 160 weeks the ratio of cars are measured. The sum of the three ratios sum up to 1. There are three types of cars in this research.
My task is to construct the mean and an 'error bar'. I used the following lines of code in R:
Day = rep(c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday",
"Saterday"),3)
cars = c(rep("nissan",7),rep("toyota",7),rep("bmw",7))
y <- colMeans(datadag,na.rm=TRUE)
delta <- apply(datadag,2,sd,na.rm=TRUE)
df=data.frame(Day,cars,y,delta)
p<-ggplot(df,aes(x=Day,y=y,group=Device,color=Device))+
geom_point() +
geom_errorbar(aes(ymin=y-delta,ymax=y+delta),width=.6)
print(p)
The code above give the following plot:
The problem I face is that the error bounds exceeds the 0 and 1 which is not possible because of the compositional data. Can anybody tell me what I did wrong?
Your problem is statistical, not to do with R. You are assuming that the standard deviation will "know" that your data cannot be negative. Consider the following.
foo <- c(0,0,1,1000)
mean(foo) - sd(foo)
[1] -249.5836
I am not sure if the same problem can arise with the standard error but I suspect it can...
I have many data sets with known outliers (big orders)
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1", 155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5, 135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6, 222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6, 231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6, 329429882.8, 264012891.6, 496745973.9, 284484362.55),ncol=2,byrow=FALSE)
The top 11 outliers of this specific series are:
outliers <- matrix(c("14Q4","14Q2","12Q1","13Q1","14Q2","11Q1","11Q4","14Q2","13Q4","14Q4","13Q1",20193525.68, 18319234.7, 12896323.62, 12718744.01, 12353002.09, 11936190.13, 11356476.28, 11351192.31, 10101527.85, 9723641.25, 9643214.018),ncol=2,byrow=FALSE)
What methods are there that i can forecast the time series taking these outliers into consideration?
I have already tried replacing the next biggest outlier (so running the data set 10 times replacing the outliers with the next biggest until the 10th data set has all the outliers replaced).
I have also tried simply removing the outliers (so again running the data set 10 times removing an outlier each time until all 10 are removed in the 10th data set)
I just want to point out that removing these big orders does not delete the data point completely as there are other deals that happen in that quarter
My code tests the data through multiple forecasting models (ARIMA weighted on the out sample, ARIMA weighted on the in sample, ARIMA weighted, ARIMA, Additive Holt-winters weighted and Multiplcative Holt-winters weighted) so it needs to be something that can be adapted to these multiple models.
Here are a couple more data sets that i used, i do not have the outliers for these series yet though
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3", 26393.99306, 13820.5037, 23115.82432, 25894.41036, 14926.12574, 15855.8857, 21565.19002, 49373.89675, 27629.10141, 43248.9778, 34231.73851, 83379.26027, 54883.33752, 62863.47728, 47215.92508, 107819.9903, 53239.10602, 71853.5, 59912.7624, 168416.2995, 64565.6211, 94698.38748, 80229.9716, 169205.0023, 70485.55409, 133196.032, 78106.02227), ncol=2,byrow=FALSE)
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3",3311.5124, 3459.15634, 2721.486863, 3286.51708, 3087.234059, 2873.810071, 2803.969394, 4336.4792, 4722.894582, 4382.349583, 3668.105825, 4410.45429, 4249.507839, 3861.148928, 3842.57616, 5223.671347, 5969.066896, 4814.551389, 3907.677816, 4944.283864, 4750.734617, 4440.221993, 3580.866991, 3942.253996, 3409.597269, 3615.729974, 3174.395507),ncol=2,byrow=FALSE)
If this is too complicated then an explanation of how, in R, once outliers are detected using certain commands, the data is dealt with to forecast. e.g smoothing etc and how i can approach that writing a code myself (not using the commands that detect outliers)
Your outliers appear to be seasonal variations with the largest orders appearing in the 4-th quarter. Many of the forecasting models you mentioned include the capability for seasonal adjustments. As an example, the simplest model could have a linear dependence on year with corrections for all seasons. Code would look like:
df <- data.frame(period= c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3",
"10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2",
"13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1"),
order= c(155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5,
135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6,
222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6,
231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6,
329429882.8, 264012891.6, 496745973.9, 42748656.73))
seasonal <- data.frame(year=as.numeric(substr(df$period, 1,2)), qtr=substr(df$period, 3,4), data=df$order)
ord_model <- lm(data ~ year + qtr, data=seasonal)
seasonal <- cbind(seasonal, fitted=ord_model$fitted)
library(reshape2)
library(ggplot2)
plot_fit <- melt(seasonal,id.vars=c("year", "qtr"), variable.name = "Source", value.name="Order" )
ggplot(plot_fit, aes(x=year, y = Order, colour = qtr, shape=Source)) + geom_point(size=3)
which gives the results shown in the chart below:
Models with a seasonal adjustment but nonlinear dependence upon year may give better fits.
You already said you tried different Arima-models, but as mentioned by WaltS, your series don't seem to contain big outliers, but a seasonal-component, which is nicely captured by auto.arima() in the forecast package:
myTs <- ts(as.numeric(data[,2]), start=c(2008, 1), frequency=4)
myArima <- auto.arima(myTs, lambda=0)
myForecast <- forecast(myArima)
plot(myForecast)
where the lambda=0 argument to auto.arima() forces a transformation (or you could take log) of the data by boxcox to take the increasing amplitude of the seasonal-component into account.
The approach you are trying to use to cleanse your data of outliers is not going to be robust enough to identify them. I should add that there is a free outlier package in R called tsoutliers, but it won't do the things I am about to show you....
You have an interesting time series here. The trend changes over time with the upward trend weakening a bit. If you bring in two time trend variables with the first beginning at 1 and another beginning at period 14 and forward you will capture this change. As for seasonality, you can capture the high 4th quarter with a dummy variable. The model is parsimonios as the other 3 quarters are not different from the average plus no need for an AR12, seasonal differencing or 3 seasonal dummies. You can also capture the impact of the last two observations being outliers with two dummy variables. Ignore the 49 above the word trend as that is just the name of the series being modeled.