I am trying to modify some code that I have, which works, to instead work with a different function for estimating a model. The original code is the following, and it works with the ARIMA function:
S=round(0.75*length(ts_HHFCE_log))
h=1
error1.h <- c()
for (i in S:(length(ts_HHFCE_log)-h))
{
mymodel.sub <- arima(ts_HHFCE_log[1:i], order = c(0,1,3),seasonal=c(0,0,0))
predict.h <- predict(mymodel.sub,n.ahead=h)$pred[h]
error1.h <- c(error1.h,ts_HHFCE_log[i+h]-predict.h)
}
The intuition is the following: Your time series has length T. You start somewhere at the beginning of your sample, but to give enough observations to regress and obtain parameter coefficients for your alpha and betas. Let's call this t for simplicity. Then based on this, you produce a one-step ahead forecast, so for time period (t+1). Then your forecast error is the difference between the actual value for (t+1) and your forecast value based on regressing on data available until t. Then you iterate, and consider from the start to (t+1), regress, and forecast (t+2). Then you obtain a forecast error for (t+2). Then basically you keep on doing this iterative process until you reach (T-1) and produce a forecast for T. This provides with what is known as a dynamic out of sample forecast error series. You do this for different models and then ascertain using a statistical test which is the more appropriate model to use. It is a way to produce out of sample forecasting using only the data you already have.
I have modified the code to be the following:
S=round(0.75*length(ts.GDP))
h=1
error1.h <- c()
for (i in S:(length(ts.GDP)-h))
{
mymodel.sub <- lm(ts.GDP[4:i] ~ ts.GDP[3:(i-1)] + ts.GDP[2:(i-2)] + ts.GDP[1:(i-3)])
predict.h <- predict(mymodel.sub,n.ahead=h)$pred[h]
error1.h <- c(error1.h,ts.GDP[i+h]-predict.h)
}
I'm trying to do an AR(3) model. The reason I am not using the ARIMA function is because I also then want to compare these forecast errors with an ARDL model, and to my knowledge there is no simple function for the ARDL model (I'd have to use lm(), hence why I want to do the AR(3) model using the lm() function).
The model I wish to compare the AR(3) model is the following:
model_ts.GDP_1 <- lm(ts.GDP[4:123] ~ ts.GDP[3:122] + ts.GDP[2:121] + ts.GDP[1:120] + ts.CCI_AGG[3:122] + ts.CCI_AGG[2:121] + ts.CCI_AGG[1:120])
I am unsure how further to modify the code to get what I am after. Hopefully the intuition bit I explained should be clear in what I am trying to do.
The data for GDP is basically the quarterly growth rate. It is stationary. The other variable in the second model is an index I've constructed using a dynamic PCA and taken first differences so it too is stationary. But in any case, in the second model, the forecast at t is based only on lagged data of each GDP and the index I constructed. Equally, given I am simulating out of sample forecast using data I have, there is no issue with actually properly forecasting. (In time series, this technique is seen as a more robust method to compare models than simply using things such as RMSE, etc.)
Thanks!
The data I am using:
Date GDP_qoq CCI_A_qoq
31/03/1988 2.956 0.540
30/06/1988 2.126 -0.743
30/09/1988 3.442 0.977
31/12/1988 3.375 -0.677
31/03/1989 2.101 0.535
30/06/1989 1.787 -0.667
30/09/1989 2.791 0.343
31/12/1989 2.233 -0.334
31/03/1990 1.961 0.520
30/06/1990 2.758 -0.763
30/09/1990 1.879 0.438
31/12/1990 0.287 -0.708
31/03/1991 1.796 -0.078
30/06/1991 1.193 -0.735
30/09/1991 0.908 0.896
31/12/1991 1.446 0.163
31/03/1992 0.870 0.361
30/06/1992 0.215 -0.587
30/09/1992 0.262 0.238
31/12/1992 1.646 -1.436
31/03/1993 2.375 0.646
30/06/1993 0.249 -0.218
30/09/1993 1.806 0.676
31/12/1993 1.218 -0.393
31/03/1994 1.501 0.346
30/06/1994 0.879 -0.501
30/09/1994 1.123 0.731
31/12/1994 2.089 0.062
31/03/1995 0.386 0.475
30/06/1995 1.238 -0.243
30/09/1995 1.836 0.263
31/12/1995 1.236 -0.125
31/03/1996 1.926 -0.228
30/06/1996 2.109 -0.013
30/09/1996 1.312 0.196
31/12/1996 0.972 -0.015
31/03/1997 1.028 -0.001
30/06/1997 1.086 -0.016
30/09/1997 2.822 0.156
31/12/1997 -0.818 -0.062
31/03/1998 1.418 0.408
30/06/1998 0.970 -0.548
30/09/1998 0.968 0.466
31/12/1998 2.826 -0.460
31/03/1999 0.599 0.228
30/06/1999 -0.651 -0.361
30/09/1999 1.289 0.579
31/12/1999 1.600 0.196
31/03/2000 2.324 0.535
30/06/2000 1.368 -0.499
30/09/2000 0.825 0.440
31/12/2000 0.378 -0.414
31/03/2001 0.868 0.478
30/06/2001 1.801 -0.521
30/09/2001 0.319 0.068
31/12/2001 0.877 0.045
31/03/2002 1.253 0.061
30/06/2002 1.247 -0.013
30/09/2002 1.513 0.625
31/12/2002 1.756 0.125
31/03/2003 1.443 -0.088
30/06/2003 0.874 -0.138
30/09/2003 1.524 0.122
31/12/2003 1.831 -0.075
31/03/2004 0.780 0.395
30/06/2004 1.665 -0.263
30/09/2004 0.390 0.543
31/12/2004 0.886 -0.348
31/03/2005 1.372 0.500
30/06/2005 2.574 -0.066
30/09/2005 0.961 0.058
31/12/2005 2.378 -0.061
31/03/2006 1.015 0.212
30/06/2006 1.008 -0.218
30/09/2006 1.105 0.593
31/12/2006 0.943 -0.144
31/03/2007 1.566 0.111
30/06/2007 1.003 -0.125
30/09/2007 1.810 0.268
31/12/2007 1.275 -0.592
31/03/2008 1.413 0.017
30/06/2008 -0.491 -0.891
30/09/2008 -0.617 -0.836
31/12/2008 -1.410 -1.092
31/03/2009 -1.593 0.182
30/06/2009 -0.106 -0.922
30/09/2009 0.788 0.351
31/12/2009 0.247 0.414
31/03/2010 1.221 -0.329
30/06/2010 1.561 -0.322
30/09/2010 0.163 0.376
31/12/2010 0.825 -0.104
31/03/2011 2.484 0.063
30/06/2011 -0.574 -0.107
30/09/2011 0.361 -0.006
31/12/2011 0.997 -0.304
31/03/2012 0.760 0.243
30/06/2012 0.143 -0.381
30/09/2012 2.547 0.315
31/12/2012 0.308 -0.046
31/03/2013 0.679 0.221
30/06/2013 0.766 -0.170
30/09/2013 1.843 0.352
31/12/2013 0.756 0.080
31/03/2014 1.380 -0.080
30/06/2014 1.501 0.162
30/09/2014 0.876 0.017
31/12/2014 0.055 -0.251
31/03/2015 0.497 0.442
30/06/2015 1.698 -0.278
30/09/2015 0.066 0.397
31/12/2015 0.470 0.076
31/03/2016 1.581 0.247
30/06/2016 0.859 -0.342
30/09/2016 0.865 -0.011
31/12/2016 1.467 0.049
31/03/2017 1.006 0.087
30/06/2017 0.437 -0.215
30/09/2017 0.527 0.098
31/12/2017 0.900 0.218
The only thing you need to understand is how to get predictions using lm, it's not necessary to add other details (without reproducible data you're only making it more difficult).
Create dummy data:
set.seed(123)
df<-data.frame(a=runif(10),b=runif(10),c=runif(10))
> print(df)
a b c
1 0.2875775 0.95683335 0.8895393
2 0.7883051 0.45333416 0.6928034
3 0.4089769 0.67757064 0.6405068
4 0.8830174 0.57263340 0.9942698
5 0.9404673 0.10292468 0.6557058
6 0.0455565 0.89982497 0.7085305
7 0.5281055 0.24608773 0.5440660
8 0.8924190 0.04205953 0.5941420
9 0.5514350 0.32792072 0.2891597
10 0.4566147 0.95450365 0.1471136
Fit your model:
model<-lm(c~a+b,data=df)
Create new data:
new_df<-data.frame(a=runif(1),b=runif(1))
> print(new_df)
a b
1 0.9630242 0.902299
Get predictions from your new data:
prediction<- predict(model,new_df)
> print(prediction)
1
0.8270997
In your case, the new data new_df will be your lagged data, but you have to make the appropriate changes, OR provide reproducible data as above if you want us to go through the details of your problem.
Hope this helps.
Related
I output a series of acf results and want to extract just lag 1 autocorrelation coefficient. Can anyone give a quick pointer? Thank you.
#A snippet of a series of acf() results
$`25`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11 12
1.000 0.366 -0.347 -0.399 -0.074 0.230 0.050 -0.250 -0.213 -0.106 0.059 0.154 0.031
$`26`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11
1.000 0.060 0.026 -0.163 -0.233 -0.191 -0.377 0.214 0.037 0.178 -0.016 0.049
$`27`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11 12
1.000 -0.025 -0.136 0.569 -0.227 -0.264 0.218 -0.262 -0.411 0.123 -0.039 -0.192 0.130
#For this example, the extracted values will be 0.366, 0.060, -0.025, the values can either
be in a list or matrix
EDIT
#`acf` in base R was used
p <- acf.each()
#sapply was tried but it resulted this
sapply(acf.each(), `[`, "1")
1 2 3
acf 0.7398 0.1746 0.4278
type "correlation" "correlation" "correlation"
n.used 24 17 14
lag 1 1 1
series "x" "x" "x"
snames NULL NULL NULL
The structure seems to be a list. We can use sapply to do the extraction
sapply(lst1, function(x) x$acf[2])
data
lst1 <- list(acf(ldeaths), acf(ldeaths))
I'm using kaggle's pokemon data to practice KNN imputation via preProcess(), but when I did I encountered this following message after the predict() step. I am wondering if I use the incorrect data format or if some columns have inappropriate "class." Below is my code.
library(dplyr)
library(ggplot2)
library(tidyr)
library(reshape2)
library(caret)
library(skimr)
library(psych)
library(e1071)
library(data.table)
pokemon <- read.csv("https://www.dropbox.com/s/znbta9u9tub2ox9/pokemon.csv?dl=1")
pokemon = tbl_df(pokemon)
# select relevant features
df <- select(pokemon, hp, weight_kg, height_m, sp_attack, sp_defense, capture_rate)
pre_process_missing_data <- preProcess(df, method="knnImpute")
classify_legendary <- predict(pre_process_missing_data, newdata = df)
and I received this error message
Error: Must subset rows with a valid subscript vector.
x Subscript `nn$nn.idx` must be a simple vector, not a matrix.
Run `rlang::last_error()` to see where the error occurred.
The input for preProcess needs to be a data.frame. This works:
pre_process_missing_data <- preProcess(as.data.frame(df), method="knnImpute")
classify_legendary <- predict(pre_process_missing_data, newdata = df)
classify_legendary
> classify_legendary
# A tibble: 801 x 6
hp weight_kg height_m sp_attack sp_defense capture_rate
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 -0.902 -0.498 -0.429 -0.195 -0.212 45
2 -0.337 -0.442 -0.152 0.269 0.325 45
3 0.415 0.353 0.774 1.57 1.76 45
4 -1.13 -0.484 -0.522 -0.349 -0.748 45
5 -0.412 -0.388 -0.0591 0.269 -0.212 45
6 0.340 0.266 0.496 2.71 1.58 45
7 -0.939 -0.479 -0.615 -0.659 -0.247 45
8 -0.375 -0.356 -0.152 -0.195 0.325 45
9 0.378 0.221 0.404 1.97 1.58 45
10 -0.902 -0.535 -0.800 -1.59 -1.82 255
# ... with 791 more rows
I am trying to calculate the sequential recordings in a time series, and aggregate the data for these sequences.
Example Data
Here is an example of the data taken at a maximum frequency of 1 second:
timestamp Value
06:07:23 0.439
06:07:24 0.556
06:07:25 0.430
06:07:26 0.418
06:07:27 0.407
06:07:47 0.439
06:07:48 0.420
06:07:49 0.405
09:55:21 0.507
09:55:22 0.439
10:03:24 0.439
10:03:25 0.439
10:03:36 1.708
10:03:37 0.608
10:03:38 0.439
10:03:46 0.484
10:03:47 0.380
10:03:48 0.607
10:03:49 0.439
10:03:50 0.439
10:03:51 0.439
10:03:52 0.430
10:03:53 0.439
10:03:54 4.924
10:03:55 1.012
10:03:56 0.887
10:03:57 0.439
10:03:58 0.439
10:04:18 0.447
10:04:19 0.447
As can be seen, there are periods whereby a value is taken every second. I am trying to find a way to aggregate if there was no gap between the observations to end up with something as follows:
timestamp max duration
06:07:23 0.556 5
06:07:47 0.439 3
09:55:21 0.507 2
10:03:24 0.439 2
10:03:36 1.708 3
10:03:46 1.012 13
10:04:18 0.447 2
I am struggling to find a way of grouping the data by the sequential data. The closest answer I have been able to find is this one, however, the answers were provided over three and a half years ago and I was struggling to get the data.table method working.
Any ideas much appreciated!
Here is an attempt in data.table:
dat[,
.(timestamp = timestamp[1], max = max(Value), duration=.N),
by = cumsum(c(FALSE, diff(as.POSIXct(dat$timestamp, format="%H:%M:%S", tz="UTC")) > 1))
]
# cumsum timestamp max duration
#1: 0 06:07:23 0.556 5
#2: 1 06:07:47 0.439 3
#3: 2 09:55:21 0.507 2
#4: 3 10:03:24 0.439 2
#5: 4 10:03:36 1.708 3
#6: 5 10:03:46 4.924 13
#7: 6 10:04:18 0.447 2
I want to regress a differenced dependent variable on differenced independent variables and on one non- differenced variable.
I tried the following lines in R:
xt <- ts(xx)
yt <- ts(yy)
zt <- ts(zz)
bt <- ts(bb)
mt <- ts(mm)
xtd <- diff(xt)
ytd <- diff(yt)
ztd <- diff(zt)
btd <- diff(bt)
axx <- ts.intersect(xtd, ytd, ztd, btd, mt)
reg1 <- lm(xtd~ytd+ztd+btd+mt, axx)
summary(reg1)
Without the command ts.intersect() a error message pops up, saying that the variable lengths differ, found for the variable mt. Which makes sense since it isnt differenced. My questions are:
i) is this a correct way to deal with different variable lengths? and ii) is there a more efficient way? many thanks in advance
Date xx yy zz bb mm
1 03.01.2005 0.065 0.001 14.4700 17.938 345001.0
2 04.01.2005 0.067 0.006 14.5100 17.886 345001.0
3 05.01.2005 0.064 -0.007 14.4200 17.950 334001.0
4 06.01.2005 0.065 -0.005 13.8000 17.950 334001.0
5 07.01.2005 0.060 -0.006 13.5700 17.913 334001.0
6 10.01.2005 0.059 -0.007 12.9200 17.958 334001.0
7 11.01.2005 0.057 -0.009 13.6800 17.962 334001.0
8 12.01.2005 0.060 -0.005 14.0500 17.886 340001.0
9 13.01.2005 0.060 -0.004 13.6400 17.568 340001.0
10 14.01.2005 0.059 -0.005 13.5700 17.471 340001.0
11 17.01.2005 0.058 -0.005 13.2000 17.365 340001.0
12 18.01.2005 0.059 -0.005 13.1700 17.214 340001.0
13 19.01.2005 0.057 -0.006 13.6300 17.143 354501.0
14 20.01.2005 0.057 -0.007 14.1700 17.125 354501.0
15 21.01.2005 0.056 -0.007 13.9600 17.193 354501.0
16 24.01.2005 0.057 -0.006 14.1100 17.283 354501.0
17 25.01.2005 0.058 -0.006 13.6300 17.083 354501.0
18 26.01.2005 0.057 -0.006 13.3200 17.348 348001.0
19 27.01.2005 0.059 -0.005 12.4600 17.295 353001.0
20 28.01.2005 0.060 -0.004 12.8100 17.219 353001.0
21 31.01.2005 0.058 -0.004 12.7200 17.143 353001.0
22 01.02.2005 0.059 -0.003 12.3600 17.125 353001.0
23 02.02.2005 0.058 -0.003 12.2500 17.000 357501.0
24 03.02.2005 0.056 -0.008 12.3800 16.808 357501.0
25 04.02.2005 0.058 -0.004 11.6000 16.817 357501.0
26 07.02.2005 0.058 -0.004 11.9900 16.798 357501.0
27 08.02.2005 0.058 -0.003 11.9200 16.804 355501.0
28 09.02.2005 0.062 0.000 12.1900 16.589 355501.0
29 10.02.2005 0.060 0.000 12.0400 16.500 355501.0
30 11.02.2005 0.062 0.002 11.9900 16.429 355501.0
The short answer is yes you need to use ts.intersect() when you have some variables that are differenced and some that are not.
You can probably clean up the code a little bit so you don't have so many lines repeated but (especially if you these are all your variables it doesn't really make a difference.
For example, you might recode all columns as time.series in one step by doing ts.d=ts(d[2:6]).
My data set includes 17 stations and for each station there are 24 hourly temperature values.
I would like to map each stations value in each hour and doing so for all the hours.
What I want to do is something like the image.
The data is in the following format:
N2 N3 N4 N5 N7 N8 N10 N12 N13 N14 N17 N19 N25 N28 N29 N31 N32
1 1.300 -0.170 -0.344 2.138 0.684 0.656 0.882 0.684 1.822 1.214 2.046 2.432 0.208 0.312 0.530 0.358 0.264
2 0.888 -0.534 -0.684 1.442 -0.178 -0.060 0.430 -0.148 1.420 0.286 1.444 2.138 -0.264 -0.042 0.398 -0.196 -0.148
3 0.792 -0.564 -0.622 0.998 -0.320 1.858 -0.036 -0.118 1.476 0.110 0.964 2.048 -0.480 -0.434 0.040 -0.538 -0.322
4 0.324 -1.022 -1.128 1.380 -0.792 1.042 -0.054 -0.158 1.518 -0.102 1.354 2.386 -0.708 -0.510 0.258 -0.696 -0.566
5 0.650 -0.774 -0.982 1.124 -0.540 3.200 -0.052 -0.258 1.452 0.028 1.022 2.110 -0.714 -0.646 0.266 -0.768 -0.532
6 0.670 -0.660 -0.844 1.248 -0.550 2.868 -0.098 -0.240 1.380 -0.012 1.164 2.324 -0.498 -0.474 0.860 -0.588 -0.324
MeteoSwiss
1 -0.6
2 -1.2
3 -1.0
4 -0.8
5 -0.4
6 -0.2
where N2, N3, ...m MeteoSwiss are the stations and each row presents the station's temperature value for each hour.
id Longitude Latitude
2 7.1735 45.86880001
3 7.17254 45.86887001
4 7.171636 45.86923601
5 7.18018 45.87158001
7 7.177229 45.86923001
8 7.17524 45.86808001
10 7.179299 45.87020001
12 7.175189 45.86974001
13 7.179379 45.87081001
14 7.175509 45.86932001
17 7.18099 45.87262001
19 7.18122 45.87355001
25 7.15497 45.87058001
28 7.153399 45.86954001
29 7.152649 45.86992001
31 7.154419 45.87004001
32 7.156099 45.86983001
MeteoSwiss 7.184 45.896
I define a toy example more or less resembling your data:
vals <- matrix(rnorm(24*17), nrow=24)
cds <- data.frame(id=paste0('N', 1:17),
Longitude=rnorm(n=17, mean=7.1),
Latitude=rnorm(n=17, mean=45.8))
vals <- as.data.frame(t(vals))
names(vals) <- paste0('H', 1:24)
The sp package defines several classes and methods to store and
display spatial data. For your example you should use the
SpatialPointsDataFrame class:
library(sp)
mySP <- SpatialPointsDataFrame(coords=cds[,-1], data=data.frame(vals))
and the spplot method to display the information:
spplot(mySP, as.table=TRUE,
col.regions=bpy.colors(10),
alpha=0.8, edge.col='black')
Besides, you may find useful the spacetime package
(paper at JSS).