tsCV h-step-ahead when h>1

tsCV h-step-ahead when h>1 - r

schematic 4-step-ahead forecasts
According to the illustration above, I'd expect the time period of the first CV error (first non-NA values) in the column h=4 to be period 10, right? The result below shows that all first CV errors start at period 7. Why is this so?
> data <- ts(rnorm(n = 50, mean = 10, sd = 5))
> tsCV(data, forecastfunction = splinef, h = 4, initial = 6) %>% head(12)
Time Series:
Start = 1
End = 12
Frequency = 1
h=1 h=2 h=3 h=4
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
7 -0.6898367 1.94707898 -0.4241705 2.6114473
8 2.2835535 -0.03213156 3.0590506 2.9266469
9 -1.0397064 1.90081726 1.6177550 4.7870414
10 2.3104741 2.08295460 5.3077838 5.1881762
11 1.2481952 4.36896765 4.1453033 3.9093216
12 3.9553215 3.68404796 3.4004571 0.4572387
Image source: https://otexts.com/fpp2/accuracy.html, Rob J Hyndman

From the help file for forecast::tsCV:
Value
Numerical time series object containing the forecast errors as a vector (if h=1) and a matrix otherwise. The time index corresponds to the last period of the training data. The columns correspond to the forecast horizons.
So the cell at time=7 and h=4 gives forecasts for time 11.

Related

data length cannot be over width of moving average

I use quantmod, to calculate the moving average over 2000 dataframes with loop
price = xts object
price <- cbind(price, SMA(price, 5), SMA(price, 10),
SMA(price, 20), SMA(price, 60), SMA(price, 120),
SMA(price, 180), SMA(price, 240))
But some data don't exceed the number of width, stop running in the middle. In that case, I just want to fill NA only.
I need some support to solve this problem.
Or if I need to use any other package for solving this problem, let me know
Thanks

Moving average functions give an error when the chosen period is longer than the available data. As #RuiBarradas mentions in the comment, for a SMA zoo::rollmean could work. As you need to loop over quite a few data.frames a function is easier. The function below could be used in an lapply function or just in a loop.
I created a sub function inside the bigger function to check if the chosen period is bigger than the rows supplied. If so, return a vector of NA's else return a SMA. After that, loop over the periods to return a data.frame with the supplied price column and all the SMA columns with a name so you can see which SMA is in which column.
Note that there is no error handling in case of incorrect inputs. Sample data below.
# periods for the SMA
periods <- c(5, 10, 20, 60, 120, 180, 240)
get_smas <- function(price, n) {
my_sma <- function(x, n = 10) {
if (n < 1 || n > NROW(x)) {
out <- rep(NA_real_, NROW(x))
} else {
# change SMA for EMA if you want the EMA's
out <- TTR::SMA(x, n = n)
}
out
}
# combine the price column with the ma's. Reduce works backwards, so price column last
price_combined <- Reduce(cbind, lapply(n, function(x) my_sma(price, n = x)), price)
# turn matrix into data.frame
price_combined <- data.frame(price_combined)
# rename columns, assuming price column has a column name.
# change paste0 value from SMA to EMA if EMA is used.
names(price_combined) <- c(names(price_combined)[1], paste0("SMA_", n))
price_combined
}
# supply a price and a vector of periods
my_prices <- get_smas(price, periods)
head(my_prices, 2)
Close SMA_5 SMA_10 SMA_20 SMA_60 SMA_120 SMA_180 SMA_240
1 182.01 NA NA NA NA NA NA NA
2 179.70 NA NA NA NA NA NA NA
tail(my_prices, 2)
Close SMA_5 SMA_10 SMA_20 SMA_60 SMA_120 SMA_180 SMA_240
142 156.79 154.156 152.053 147.475 145.4393 156.1770 NA NA
143 157.35 154.556 152.941 148.381 145.4292 156.0474 NA NA
data:
# close prices of aapl from 2022-01-03 to 2022-07-28
price <- structure(list(Close = c(182.009995, 179.699997, 174.919998,
172, 172.169998, 172.190002, 175.080002, 175.529999, 172.190002,
173.070007, 169.800003, 166.229996, 164.509995, 162.410004, 161.619995,
159.779999, 159.690002, 159.220001, 170.330002, 174.779999, 174.610001,
175.839996, 172.899994, 172.389999, 171.660004, 174.830002, 176.279999,
172.119995, 168.639999, 168.880005, 172.789993, 172.550003, 168.880005,
167.300003, 164.320007, 160.070007, 162.740005, 164.850006, 165.119995,
163.199997, 166.559998, 166.229996, 163.169998, 159.300003, 157.440002,
162.949997, 158.520004, 154.729996, 150.619995, 155.089996, 159.589996,
160.619995, 163.979996, 165.380005, 168.820007, 170.210007, 174.070007,
174.720001, 175.600006, 178.960007, 177.770004, 174.610001, 174.309998,
178.440002, 175.059998, 171.830002, 172.139999, 170.089996, 165.75,
167.660004, 170.399994, 165.289993, 165.070007, 167.399994, 167.229996,
166.419998, 161.789993, 162.880005, 156.800003, 156.570007, 163.639999,
157.649994, 157.960007, 159.479996, 166.020004, 156.770004, 157.279999,
152.059998, 154.509995, 146.5, 142.559998, 147.110001, 145.539993,
149.240005, 140.820007, 137.350006, 137.589996, 143.110001, 140.360001,
140.520004, 143.779999, 149.639999, 148.839996, 148.710007, 151.210007,
145.380005, 146.139999, 148.710007, 147.960007, 142.639999, 137.130005,
131.880005, 132.759995, 135.429993, 130.059998, 131.559998, 135.869995,
135.350006, 138.270004, 141.660004, 141.660004, 137.440002, 139.229996,
136.720001, 138.929993, 141.559998, 142.919998, 146.350006, 147.039993,
144.869995, 145.860001, 145.490005, 148.470001, 150.169998, 147.070007,
151, 153.039993, 155.350006, 154.089996, 152.949997, 151.600006,
156.789993, 157.350006)), class = "data.frame", row.names = c(NA,
-143L))

rollmeanr and rollapplyr can handle the situation with fewer data items than width.
library(zoo)
price <- 1:6
rollmeanr(price, 10, fill = NA)
## [1] NA NA NA NA NA NA
w <- c(5, 10, 20, 60, 120, 180, 240)
sapply(setNames(w, w), rollmeanr, x = price, fill = NA)
## 5 10 20 60 120 180 240
## [1,] NA NA NA NA NA NA NA
## [2,] NA NA NA NA NA NA NA
## [3,] NA NA NA NA NA NA NA
## [4,] NA NA NA NA NA NA NA
## [5,] 3 NA NA NA NA NA NA
## [6,] 4 NA NA NA NA NA NA

Create column from data on dynamic number of columns depending on availabity in R

Given a uncertain number of columns containing source values for the same variable I would like to create a column that defines the final value to be selected depending on source importance and availability.
Reproducible data:
set.seed(123)
actuals = runif(10, 500, 1000)
get_rand_vector <- function(){return (runif(10, 0.95, 1.05))}
get_na_rand_ixs <- function(){return (round(runif(5,0,10),0))}
df = data.frame("source_1" = actuals*get_rand_vector(),
"source_2" = actuals*get_rand_vector(),
"source_n" = actuals*get_rand_vector())
df[["source_1"]][get_na_rand_ixs()] <- NA
df[["source_2"]][get_na_rand_ixs()] <- NA
df[["source_n"]][get_na_rand_ixs()] <- NA
My manual solution is as follows:
df$available <- ifelse(
!is.na(df$source_1),
df$source_1,
ifelse(
!is.na(df$source_2),
df$source_2,
df$source_n
)
)
Given the desired result of:
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092
How could I automatically iterate over the available sources to set the data to be considered? Given in some cases n_sources could be 1,2,3..,7 and priority follows the natural order (1 > 2 >..)

Once you have all of the candidate vectors in order and in an appropriate data structure (e.g., data.frame or matrix), you can use apply to apply a function over the rows. In this case, we just look for the first non-NA value. Thus, after the first block of code above, you only need the following line:
df$available <- apply(df, 1, FUN = function(x) x[which(!is.na(x))[1]])

coalesce() from dplyr is designed for this:
library(dplyr)
df %>%
mutate(available = coalesce(!!!.))
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092

Fill data frame by column with for loop

I created an empty data frame with 11 columns and 15 rows and subsequently named the columns.
L_df <- data.frame(matrix(ncol = 11, nrow = 15))
names(L_df) <- paste0("L_por", 0:10)
w <- c(0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2, 2.2, 2.4, 2.6, 2.8, 3)
wu <- 0
L <- 333.7
pm <- c(2600, 2574, 2548, 2522, 2496, 2470, 2444, 2418, 2392, 2366, 2340)
The data frame looks like this:
head(L_df)
L_por0 L_por1 L_por2 L_por3 L_por4 L_por5 L_por6 L_por7 L_por8 L_por9 L_por10
1 NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA
Now, I would like to fill the data frame by column, based on a formula. I tried to express this with a nested for loop:
for (i in 1:ncol(L_df)) {
pm_tmp <- pm[i]
col_tmp <- colnames(L_df)[i]
for (j in 1:nrow(L_df)) {
w_tmp <- w[j]
L_por_tmp <- pm_tmp*L*((w_tmp-wu)/100)
col_tmp[j] <- L_por_tmp
}
}
For each column, I iterate over a predefined vector pm of length 11. For each row, I iterate over a predefined vector w of length 15 (repeats each column).
Example: First, select pm[1] for the first column. Second, select w[i] for each row in the first column. Store the formula in L_por_tmp and use it to fill the first column from row1 to row15. The whole procedure should start all over again for the second column (with pm[2]) with w[i] for each row and so on. wu and L are fixed in the formula.
R executes the code without an error. When I check the tmp values, they are correct. However, the data frame remains empty. L_df does not get filled. I would like solve this with a loop but if you have other solutions, I am happy to hear them! I get the impression there might be a smoother way of doing this. Cheers!

Solution
L_df <- data.frame(sapply(pm, function(x) x * L * ((w - wu) / 100)))
names(L_df) <- c("L_por0", "L_por1", "L_por2", "L_por3", "L_por4", "L_por5",
"L_por6", "L_por7", "L_por8", "L_por9", "L_por10")
L_df
L_por0 L_por1 L_por2 L_por3 L_por4 L_por5 L_por6 L_por7
1 1735.24 1717.888 1700.535 1683.183 1665.830 1648.478 1631.126 1613.773
2 3470.48 3435.775 3401.070 3366.366 3331.661 3296.956 3262.251 3227.546
3 5205.72 5153.663 5101.606 5049.548 4997.491 4945.434 4893.377 4841.320
4 6940.96 6871.550 6802.141 6732.731 6663.322 6593.912 6524.502 6455.093
5 8676.20 8589.438 8502.676 8415.914 8329.152 8242.390 8155.628 8068.866
6 10411.44 10307.326 10203.211 10099.097 9994.982 9890.868 9786.754 9682.639
7 12146.68 12025.213 11903.746 11782.280 11660.813 11539.346 11417.879 11296.412
8 13881.92 13743.101 13604.282 13465.462 13326.643 13187.824 13049.005 12910.186
9 15617.16 15460.988 15304.817 15148.645 14992.474 14836.302 14680.130 14523.959
10 17352.40 17178.876 17005.352 16831.828 16658.304 16484.780 16311.256 16137.732
11 19087.64 18896.764 18705.887 18515.011 18324.134 18133.258 17942.382 17751.505
12 20822.88 20614.651 20406.422 20198.194 19989.965 19781.736 19573.507 19365.278
13 22558.12 22332.539 22106.958 21881.376 21655.795 21430.214 21204.633 20979.052
14 24293.36 24050.426 23807.493 23564.559 23321.626 23078.692 22835.758 22592.825
15 26028.60 25768.314 25508.028 25247.742 24987.456 24727.170 24466.884 24206.598
L_por8 L_por9 L_por10
1 1596.421 1579.068 1561.716
2 3192.842 3158.137 3123.432
3 4789.262 4737.205 4685.148
4 6385.683 6316.274 6246.864
5 7982.104 7895.342 7808.580
6 9578.525 9474.410 9370.296
7 11174.946 11053.479 10932.012
8 12771.366 12632.547 12493.728
9 14367.787 14211.616 14055.444
10 15964.208 15790.684 15617.160
11 17560.629 17369.752 17178.876
12 19157.050 18948.821 18740.592
13 20753.470 20527.889 20302.308
14 22349.891 22106.958 21864.024
15 23946.312 23686.026 23425.740
Explanation
The sapply() function can be used to iterate over vectors in a more idiomatic way for R programming. We iterate over pm and use your formula once since R is vectorised; each time it creates a vector of length 15 (so 11 vectors of length 15), and when we wrap it in data.frame() returns the data frame you want and we add in the column names.
NOTE: Applying functions to every element of a vector using an apply() family function has some different implications than iterating using for loops. In your case, I think sapply() is easier and more understandable. For more information on when you need a loop or when something like apply is better, see for example this discussion from Hadley Wickham's Advanced R book.

You are just doing a small mistake and you were almost there, Edited your function:
for (i in 1:ncol(L_df)) {
pm_tmp <- pm[i]
col_tmp <- colnames(L_df)[i]
for (j in 1:nrow(L_df)) {
w_tmp <- w[j]
L_por_tmp <- pm_tmp*L*((w_tmp-wu)/100)
L_df[ j ,col_tmp] <- L_por_tmp ##You must have used df[i, j] referencing here
}
}
Output:
Just printing the head of few rows:
L_df
L_por0 L_por1 L_por2 L_por3 L_por4 L_por5 L_por6 L_por7 L_por8 L_por9 L_por10
1 1735.24 1717.888 1700.535 1683.183 1665.830 1648.478 1631.126 1613.773 1596.421 1579.068 1561.716
2 3470.48 3435.775 3401.070 3366.366 3331.661 3296.956 3262.251 3227.546 3192.842 3158.137 3123.432
3 5205.72 5153.663 5101.606 5049.548 4997.491 4945.434 4893.377 4841.320 4789.262 4737.205 4685.148

combine two zoo time series

I have 2 sucesive ZOO time series (the date of one begins after the other finishes), they have the following form (but much longer and not only NA values):
a:
1979-01-01 1979-01-02 1979-01-03 1979-01-04 1979-01-05 1979-01-06 1979-01-07 1979-01-08 1979-01-09
NA NA NA NA NA NA NA NA NA
b:
1988-08-15 1988-08-16 1988-08-17 1988-08-18 1988-08-19 1988-08-20 1988-08-21 1988-08-22 1988-08-23 1988-08-24 1988-08-25
NA NA NA NA NA NA NA NA NA NA NA
all I want to do is combine them in one time serie as a ZOO object, it seems to be a basic task but I am doing something wrong. I use the function "merge":
combined <- merge(a, b)
but the result is something in the form:
a b
1980-03-10 NA NA
1980-03-11 NA NA
1980-03-12 NA NA
1980-03-13 NA NA
1980-03-14 NA NA
1980-03-15 NA NA
1980-03-16 NA NA
.
.
which is not a time series, and the lengths dont fit:
> length(a)
[1] 10957
> length(b)
[1] 2557
> length(combined)
[1] 27028
how can I just combine them into one time series with the form of the original ones?

Assuming the series shown reproducibly in the Note at the end, the result of merging the two series has 20 times and 2 columns (one for each series). The individual series have lengths 9 and 11 elements and the merged series is a zoo object with 9 + 11 = 20 rows (since there are no intersecting times) and 2 columns (one for each input) and length 40 (= 20 * 2). Note that the length of a multivariate series is the number of elements in it, not the number of time points.
length(z1)
## [1] 9
length(z2)
## [1] 11
m <- merge(z1, z2)
class(m)
## [1] "zoo"
dim(m)
## [1] 20 2
nrow(m)
## [1] 20
length(index(m))
## [1] 20
length(m)
## [1] 40
If what you wanted is to string them out one after another then use c:
length(c(z1, z2))
## [1] 20
The above are consistent with how merge, c and length work in base R.
Note:
library(zoo)
z1 <- zoo(rep(NA, 9), as.Date(c("1979-01-01", "1979-01-02", "1979-01-03",
"1979-01-04", "1979-01-05", "1979-01-06", "1979-01-07", "1979-01-08",
"1979-01-09")))
z2 <- zoo(rep(NA, 11), as.Date(c("1988-08-15", "1988-08-16", "1988-08-17",
"1988-08-18", "1988-08-19", "1988-08-20", "1988-08-21", "1988-08-22",
"1988-08-23", "1988-08-24", "1988-08-25")))

Finding the maximum values associated to a given string length in a dataframe

I'm exploring the acss package.
I want to know which strings for a given length of the acss_data dataframe have been assigned maximum K.i value.
tail(acss_data)
K.2 K.4 K.5 K.6 K.9
012345678883 NA NA NA NA 50.28906
012345678884 NA NA NA NA 50.31291
012345678885 NA NA NA NA 49.71200
012345678886 NA NA NA NA 49.81041
012345678887 NA NA NA NA 49.51936
012345678888 NA NA NA NA 48.61247
The acss_data dataframe contains K.2, K.4, K.5, K.6, and K.9 values associated to strings from lengths 1 to 12, and I want to know the maximum K.i for each string length, i.e, I want to know the max K.2 for strings of length 1, length 2, ... length 12. Then I would like to know the max K.4 for strings of length 1, length 2, ... length 12, etc.
How can I query this in R?

You can use aggregate to summarize the data:
library(acss.data)
d=acss_data
d$len=nchar(rownames(d)) # calculate lengths of strings
d[is.na(d)]=-1 # fix NAs for max function
s=aggregate(d[,1:5], list(d$len), max)
The result is a data frame:
Group.1 K.2 K.4 K.5 K.6 K.9
1 1 2.514277 3.547388 3.947032 4.268200 4.964344
2 2 3.327439 5.414104 6.108780 6.675197 7.927055
3 3 5.505383 8.520908 9.432003 10.189697 11.905392
4 4 8.406714 12.231447 13.284113 14.182866 16.280365
5 5 11.834019 16.230760 17.340010 18.329451 20.735158
6 6 15.366332 19.993828 21.291613 22.410022 25.170522
7 7 18.989162 23.816377 25.389206 26.615356 29.685526
8 8 22.679752 27.556472 29.379371 30.880603 34.243156
9 9 26.343527 31.187297 33.264487 35.097073 38.851463
10 10 29.427574 34.891807 37.282071 39.258235 43.506412
11 11 32.778797 39.506517 42.000889 43.657406 48.208571
12 12 37.064199 40.506517 42.263923 43.657406 52.897870

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

tsCV h-step-ahead when h>1 - r

Related

data length cannot be over width of moving average

Create column from data on dynamic number of columns depending on availabity in R

Fill data frame by column with for loop

combine two zoo time series

Finding the maximum values associated to a given string length in a dataframe

Categories

Resources