R fitting and forecasting daily time series - r

I am working with a daily time serie and I need to build a forecast for 90 days (or maybe more) based on my history - The current time serie has roughly 298 data points.
The issue I have is the famous flat line in the final forecast - and yes I might not have a seasonality but I am trying to work this out. Another issue is how to find the best model and adapt it from here on for this kind of behaviour.
I created a test case to investigate this further and any help is appreciated.
Thanks,
To start with
x <- day_data # My time serie
z <- 90 # Days to forecast
low_bound_date <- as.POSIXlt(min(x$time), format = "%m/%d/%Y") # oldest date in the DF.
> low_bound_date
[1] "2015-12-21 PST"
low_bound_date$yday
> low_bound_date$yday # Day in Julian
[1] 354
lbyear <- as.numeric(substr(low_bound_date, 1, 4))
> lbyear
[1] 2015
This is my time serie content
> ts
Time Series:
Start = c(2065, 4)
End = c(2107, 7)
Frequency = 7
[2] 20.73 26.19 27.51 26.11 26.28 27.58 26.84 27.00 26.30 28.75 28.43 39.03 41.36 45.42 44.80 45.33 47.79 44.70 45.17
[20] 34.90 32.54 32.75 33.35 34.76 34.11 33.59 33.60 38.08 30.45 29.66 31.09 31.36 31.96 29.30 30.04 30.85 31.13 25.09
[39] 17.88 23.73 25.31 31.30 35.18 34.13 34.96 35.12 27.36 38.33 38.59 38.14 38.54 41.72 37.15 35.92 37.37 32.39 30.64
[58] 30.57 30.66 31.16 31.50 30.68 32.21 32.27 32.55 33.61 34.80 33.53 33.09 20.90 6.91 7.82 15.78 7.25 6.19 6.38
[77] 38.06 39.82 35.53 38.63 41.91 39.76 37.26 38.79 37.74 35.61 39.70 35.79 35.36 29.63 22.07 35.39 35.99 37.35 38.82
[96] 25.80 21.31 18.85 9.52 20.75 36.83 44.12 37.79 34.45 36.05 16.39 21.84 31.39 34.26 31.50 30.87 28.88 42.83 41.52
[115] 42.34 47.35 44.47 44.10 44.49 26.89 18.17 40.44 43.93 41.56 39.98 40.31 40.59 40.17 40.22 40.50 32.68 35.89 36.06
[134] 34.30 22.67 12.56 13.29 12.34 28.00 35.27 36.57 33.78 32.15 33.58 34.62 30.96 32.06 33.05 30.66 32.47 30.42 32.83
[153] 31.74 29.39 22.39 12.58 16.46 5.36 4.01 15.32 32.79 31.66 32.02 27.60 31.47 31.61 34.96 27.77 31.91 33.94 33.43
[172] 26.94 28.38 21.42 24.51 23.82 31.71 26.64 27.96 29.29 29.25 28.70 27.02 27.62 30.90 27.46 27.37 26.46 27.77 13.61
[191] 5.87 12.18 5.68 4.15 4.35 4.42 16.42 25.18 26.06 27.39 27.57 28.86 15.18 5.19 5.61 8.28 7.78 5.13 4.90
[210] 5.02 5.27 16.31 25.01 26.19 25.96 24.93 25.53 25.56 26.39 26.80 26.73 26.00 25.61 25.90 25.89 13.80 6.66 6.41
[229] 5.28 5.64 5.71 5.38 5.76 7.20 7.27 5.55 5.31 5.94 5.75 5.93 5.77 6.57 5.52 5.51 5.47 5.69 19.75
[248] 29.22 30.75 29.63 30.49 29.48 31.83 30.42 29.27 30.40 29.91 32.00 30.09 28.93 14.54 7.75 5.63 17.17 22.27 24.93
[267] 35.94 37.42 33.13 25.88 24.27 37.64 37.42 38.33 35.20 21.32 7.32 4.81 5.17 17.49 23.77 23.36 27.60 26.53 24.99
[286] 24.22 23.76 24.10 24.22 27.06 25.53 23.40 37.07 26.52 25.19 28.02 28.53 26.67
First step, I get my data in ts
day_data_ts <- ts(x$avg_day, start = c(lbyear,low_bound_date$yday), frequency=7)
plot(day_data_ts)
plot_ts
acf(day_data_ts)
acf_ts
Second step, I get my data in msts
day_data_msts <- msts(x$avg_day, seasonal.periods=c(7,365.25), start = c(lbyear,low_bound_date$yday))
plot(day_data_msts)
acf(day_data_msts)
I did several fitting iterations to try and figure out the best fit and forecast model.
First fitting test is with the ts only.
fit1 <- HoltWinters(day_data_ts)
> fit1
Holt-Winters exponential smoothing with trend and additive seasonal component.
Call: HoltWinters(x = day_data_ts)
Smoothing parameters: alpha: 1 beta : 0.006757112 gamma: 0
Coefficients:
[,1]
a 28.0922449
b 0.1652477
s1 0.6241837
s2 1.9084694
s3 0.9913265
s4 0.8198980
s5 -1.7015306
s6 -1.2201020
s7 -1.4222449
fit2 <- tbats(day_data_ts)
> fit2
BATS(1, {0,0}, 0.8, -)
Parameters: Alpha: 1.309966 Beta: -0.3011143 Damping Parameter: 0.800001
Seed States:
[,1]
[1,] 15.282259
[2,] 2.177787
Sigma: 5.501356 AIC: 2723.911
fit3 <- ets(day_data_ts)
> fit3
ETS(A,N,N)
Smoothing parameters: alpha = 0.9999
Initial states: l = 25.2275
sigma: 5.8506
AIC AICc BIC
2756.597 2756.678 2767.688
fit4 <- auto.arima(day_data_ts)
> fit4
ARIMA(1,1,2)
Coefficients:
ar1 ma1 ma2
0.7396 -0.6897 -0.2769
s.e. 0.0545 0.0690 0.0621
sigma^2 estimated as 30.47: log likelihood=-927.9
AIC=1863.81 AICc=1863.94 BIC=1878.58
Second test is using msts. I also changed the ets model to MAM.
fit5 <- tbats(day_data_msts)
> fit5
BATS(1, {0,0}, 0.8, -)
Parameters: Alpha: 1.309966 Beta: -0.3011143 Damping Parameter: 0.800001
Seed States:
[,1]
[1,] 15.282259
[2,] 2.177787
Sigma: 5.501356 AIC: 2723.911
fit6 <- ets(day_data_msts, model="MAN")
> fit6
ETS(M,A,N)
Smoothing parameters: alpha = 0.9999 beta = 9e-04
Initial states: l = 52.8658 b = 3.9184
sigma: 0.3459
AIC AICc BIC
3042.744 3042.949 3061.229
fit7 <- auto.arima(day_data_msts)
> fit7
ARIMA(1,1,2)
Coefficients:
ar1 ma1 ma2
0.7396 -0.6897 -0.2769
s.e. 0.0545 0.0690 0.0621
sigma^2 estimated as 30.47: log likelihood=-927.9
AIC=1863.81 AICc=1863.94 BIC=1878.58

You can forecast on previously estimated model as follows (use built in timeseries LakeHuron):
library(forecast)
y <- LakeHuron
tsdisplay(y)
# estimate ARMA(1,1)
mod_2 <- Arima(y, order = c(1, 0, 1))
#make forecast for 5 periods (years in this case)
fHuron <- forecast(mod_2, h = 5)
#show results in table
fHuron
#plot results
plot(fHuron)
This will give you:
Pay attention that ARIMA model bases its forecast on previous values, so if we make prediction on many periods the model will use already predicted values to predict next. Which will reduce accuracy.
To fit optimal ARIMA model use this function:
library(R.utils) #for the function 'withTimeout'
fitARIMA<-function(timeseriesObject, timout)
{
final.aic <- Inf
final.order <- c(0,0,0)
for (p in 0:5) for (q in 0:5) {
if ( p == 0 && q == 0) {
next
}
arimaFit = tryCatch(
withTimeout(arima(timeseriesObject
,order=c(p, 0, q))
,timeout = timeout)
,error=function( err ) FALSE
,warning=function( err ) FALSE )
if( !is.logical( arimaFit ) ) {
current.aic <- AIC(arimaFit)
if (current.aic < final.aic) {
final.aic <- current.aic
final.order <- c(p, 0, q)
final.arima <- arima(timeseriesObject, order=final.order)
}
} else {
next
}
}
final.order<-c(final.order,final.aic)
final.order
}

Related

CV and adjCV values are same in PCR

I am running PCR on a data set, but my results from PCR is giving me the same values for both CV and adjCV, is this correct or there is anything wrong with the data.
Here is my code:
pcr <- pcr(F1~., data = data, scale = TRUE, validation = "CV")
summary(PCR)
validationplot(pcr)
validationplot(pcr, val.type = "MSEP")
validationplot(pcr, val.type = "R2")
predplot(pcr)
coefplot(PCR)
set.seed(123)
ind <- sample(2, nrow(data), replace = TRUE,
prob = c(0.8,0.2))
train <- data[ind ==1,]
test <- data[ind ==2,]
pcr_train <- pcr(F1~., data = train, scale =TRUE, validation = "CV")
y_test <- test[, 1]
pcr_pred <- predict(pcr, test, ncomp = 4)
mean((pcr_pred - y_test) ^2)
And I am getting this error when I print the mean command
Warning in mean.default((pcr_pred - y_test)^2) :
argument is not numeric or logical: returning NA
Sample data:
F1 F2 F3 F4 F5
4.378 2.028 -5.822 -3.534 -0.546
4.436 2.064 -5.872 -3.538 -0.623
4.323 1.668 -5.954 -3.304 -0.782
5.215 3.319 -5.863 -4.139 -0.632
4.074 1.497 -6.018 -3.176 -0.697
4.403 1.761 -6 -3.339 -0.847
4.99 3.105 -5.985 -3.97 -0.638
4.783 2.968 -5.94 -3.903 -0.481
4.361 1.786 -5.866 -3.397 -0.685
4.594 1.958 -5.985 -3.457 -0.91
0.858 -4.734 -6.104 -0.692 -0.87
0.878 -3.846 -6.289 -1.064 -0.618
0.876 -4.479 -6.148 -0.803 -0.801
0.937 -5.498 -5.958 -0.376 -1.184
0.953 -4.71 -6.123 -0.705 -0.96
0.738 -5.386 -5.877 -0.444 -0.884
0.833 -5.562 -5.937 -0.343 -1.104
1.184 -3.52 -6.221 -1.234 -0.38
1.3 -4.129 -6.168 -0.963 -0.73
3.359 -3.618 -5.302 0.481 -0.649
3.483 -2.938 -5.361 0.157 -0.482
3.673 -3.779 -5.326 0.516 -1.053
2.521 -6.577 -4.499 1.861 -1.374
2.52 -4.757 -4.866 1.182 -0.736
2.482 -4.732 -4.857 1.142 -0.708
2.543 -6.699 -4.496 1.947 -1.426
2.458 -3.182 -5.219 0.514 -0.255
2.558 -5.66 -4.757 1.558 -1.142
2.627 -1.806 -5.313 -1.808 1.054
3.773 -0.526 -5.236 -0.6 -0.23
3.65 -0.954 -4.97 -0.361 -0.413
3.816 -1.18 -5.228 -0.284 -0.575
3.752 -0.522 -5.346 -0.562 -0.293
3.961 -0.24 -5.423 -0.69 -0.408
3.734 -0.711 -5.307 -0.479 -0.347
4.094 -0.415 -5.103 -0.729 -0.35
3.894 -0.957 -5.133 -0.435 -0.457
3.741 -0.484 -5.363 -0.574 -0.279
3.6 -0.698 -5.422 -0.435 -0.306
3.845 -0.351 -5.306 -0.666 -0.269
3.886 -0.481 -5.332 -0.596 -0.39
3.552 -2.106 -5.043 0.128 -0.634
4.336 -10.323 -2.95 3.346 -3.494
3.918 -0.809 -5.315 -0.442 -0.567
3.757 -0.502 -5.347 -0.572 -0.288
3.712 -0.627 -5.353 -0.505 -0.314
3.954 -0.72 -5.492 -0.428 -0.691
4.088 -0.588 -5.412 -0.53 -0.688
3.728 -0.641 -5.338 -0.505 -0.321

How can I extract specific data points from a wide-formatted text file in R?

I have datasheets with multiple measurements that look like the following:
FILE DATE TIME LOC QUAD LAI SEL DIFN MTA SEM SMP
20 20210805 08:38:32 H 1161 2.80 0.68 0.145 49. 8. 4
ANGLES 7.000 23.00 38.00 53.00 68.00
CNTCT# 1.969 1.517 0.981 1.579 1.386
STDDEV 1.632 1.051 0.596 0.904 0.379
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.137 0.192 0.288 0.073 0.025
A 1 08:38:40 31.66 33.63 34.59 39.13 55.86
1 2 08:38:40 -5.0e-006
B 3 08:38:48 25.74 20.71 15.03 2.584 1.716
B 4 08:38:55 0.344 1.107 2.730 0.285 0.265
B 5 08:39:02 3.211 5.105 13.01 4.828 1.943
B 6 08:39:10 8.423 22.91 48.77 16.34 3.572
B 7 08:39:19 12.58 14.90 18.34 18.26 4.125
I would like to read the entire datasheet and extract the values for 'QUAD' and 'LAI' only. For example, for the data above I would only be extracting a QUAD of 1161 and an LAI of 2.80.
In the past the datasheets were formatted as long data, and I was able to use the following code:
library(stringr)
QUAD <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^QUAD).*$")))
LAI <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
QUAD = QUAD[!is.na(QUAD)],
LAI = LAI[!is.na(LAI)]
)
data_extract
Unfortunately, this does not work because of the wide formatting in the current datasheet. Any help would be hugely appreciated. Thanks in advance for your time.

Warning message in applying fGarch package for fitting a simple GARCH model

I tried to fit a simple GARCH model for following data set(which contains weekly prices of a agricultural commodity) using fGarch package. But, after every other variant of model, r gives following error message, implying the function is not running correctly.
"Warning message:
Using formula(x) is deprecated when x is a character vector of length > 1.
Consider formula(paste(x, collapse = " ")) instead."
I do not know how to correct the error and proceed with modelling. Seeking advises to run the model correctly. Thank you very much in advance.
Codes used:
library(fGarch)
garch<-read.csv("crrp.csv",header=T,sep="," )
attach(garch)
head(garch)
tcrrp = ts(garch$crrp, start=c(1997,1),end=c(1998,52), frequency=52)
lcr<-(log(tcrrp))
dlcr<-diff(lcr)
dat<-cbind(dlcr)
car1<-garchFit(dlcr~garch(1, 0), data = dat, trace=FALSE, cond.dist='std')
summary(car1)
"Warning message:
Using formula(x) is deprecated when x is a character vector of length > 1.
Consider formula(paste(x, collapse = " ")) instead"
Data
crrp
35.57
33.89
33.65
32.48
32.5
32.59
34.01
34.35
35.32
35
35
36.5
34.29
33.09
43.59
42.44
43.1
40.38
45.28
47.49
53.57
59.96
60.15
60.16
61.53
57.24
52.24
49.68
47.73
40.95
36
33.67
32.82
32
32
32
31.9
31.67
31.14
31.73
31.87
32.44
33.49
37.5
40.51
45.76
51.16
59.33
67.27
75.72
76.05
84.19
89.33
87.1
88.25
84.86
91.14
90.72
72.84
59.18
59.9
62.2
54.05
47.02
43.86
42.18
44.1
45.67
42.49
43.36
46.93
44.56
66.11
66.76
64.62
65.9
69.86
68.58
63.72
56.46
54.2
56.62
51.3
50.3
42.88
40.14
43.37
38.27
36.29
34.26
33.2
34.1
34.11
34.9
35.93
34.93
33.8
34.1
34.95
35.02
34.64
34.16
38.49
48.13
I had the same question with you,and I solved it by I run the following code at the same time instead of run them separately(taking an example as your code).
car1 <- garchFit(dlcr~garch(1, 0), data = dat, trace=FALSE, cond.dist='std')
summary(car1)

Computing regression intercepts and saving them in a separate column

Our dataframe has the following structure:
RecordID datecode Name Betha RF MKTRF Ri
60 1 2014-12-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 -0.06 1.61
61 1 2015-01-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 -3.11 -3.53
62 1 2015-02-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 6.13 5.49
63 1 2015-03-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 -1.12 0.29
64 1 2015-04-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 0.59 0.67
65 1 2015-05-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 1.36 0.57
392035 3267 2019-07-01 Wasatch Core Growth Institutional 0.6421722 0.0019 1.19 6.75
392036 3267 2019-08-01 Wasatch Core Growth Institutional 0.6421722 0.0016 -2.58 0.09
392037 3267 2019-09-01 Wasatch Core Growth Institutional 0.6421722 0.0018 1.44 4.99
392038 3267 2019-10-01 Wasatch Core Growth Institutional 0.6421722 0.0015 2.06 -3.68
392039 3267 2019-11-01 Wasatch Core Growth Institutional 0.6421722 0.0012 3.87 5.35
392040 3267 2019-12-01 Wasatch Core Growth Institutional 0.6421722 0.0014 2.77 1.12
We need to compute yearly Jensen's Alpha and Fama & French 3-factor alphas and store them in separate columns in order to run regressions on them. The formulas for both regressions are illustrated below:
Jensen Alpha: Ri ~ a + B1*MKTRF + e
3-factor Alpha: Ri ~ a + B1*MKTRF + B2*SMB + B3*HML + B4*UMD + e
We have tried saving the data as a data table and in a panel data format and run this regression from a similar post to compute the Jensen Alpha:
dt[, alpha:= roll_regres.fit(x = cbind(1, .SD[["MKTRF"]]), y = .SD[["Ri"]], width = 12L)$coefs[, 1], by = RecordID]
The post: Rollregres with multiple regression and panel data
However, it did not work and kept giving the error message:
Error in roll_regres.fit(x = cbind(1, .SD[["MKTRF"]]), y = .SD[["Ri"]], :
subscript out of bounds
We are trying to use the "rollRegress" package and no additional packages have been used.
What are we doing wrong and is there anybody that can help us compute the yearly Jensen Alpha which we can store in a separate column? :)

Why MARS (earth package) generates so many predictors?

I am working on a MARS model using earth package in R. My dataset (CE.Rda) consists of one dependent variable (D9_RTO_avg) and 10 potential predictors (NDVI_l1, NDVI_f0, NDVI_f1, NDVI_f2, NDVI_f3, LST_l1, LST_f0, LST_f1, NDVI_f2,NDVI_f3). Next, I show you the head of my dataset
D9_RTO_avg NDVI_l1 NDVI_f0 NDVI_f1 NDVI_f2 NDVI_f3 LST_l1 LST_f0 LST_f1 LST_f2 LST_f3
2 1.866667 0.3082 0.3290 0.4785 0.4330 0.5844 38.25 30.87 31 21.23 17.92
3 2.000000 0.2164 0.2119 0.2334 0.2539 0.4686 35.7 29.7 28.35 21.67 17.71
4 1.200000 0.2324 0.2503 0.2640 0.2697 0.4726 40.13 33.3 28.95 22.81 16.29
5 1.600000 0.1865 0.2070 0.2104 0.2164 0.3911 43.26 35.79 30.22 23.07 17.88
6 1.800000 0.2757 0.3123 0.3462 0.3778 0.5482 43.99 36.06 30.26 21.36 17.93
7 2.700000 0.2265 0.2654 0.3174 0.2741 0.3590 41.61 35.4 27.51 23.55 18.88_
After creating my earth model as follows
mymodel.mod <- earth(D9_RTO_avg ~ ., data=CE, nk=10)
I print the summary of the resulting model by typing
print(summary(mymodel.mod, digits=2, style="pmax"))
and I obtain the following output
D9_RTO_avg =
4.1
+ 38 * LST_f128.68
+ 6.3 * LST_f216.41
- 2.9 * pmax(0, 0.66 - NDVI_l1)
- 2.3 * pmax(0, NDVI_f3 - 0.23)
Selected 5 of 7 terms, and 4 of 13169 predictors
Termination condition: Reached nk 10
Importance: LST_f128.68, NDVI_l1, NDVI_f3, LST_f216.41, NDVI_f0-unused, NDVI_f1-unused, NDVI_f2-unused, ...
Number of terms at each degree of interaction: 1 4 (additive model)
GCV 2 RSS 4046 GRSq 0.29 RSq 0.29
My question is why earth is identifying 13169 predictors when they are actually 10!? It seems that MARS is considering single observations of candidate predictors as predictors themselves. How can I avoid MARS from doing so?
Thanks for your help

Resources