combine two zoo time series - r

I have 2 sucesive ZOO time series (the date of one begins after the other finishes), they have the following form (but much longer and not only NA values):
a:
1979-01-01 1979-01-02 1979-01-03 1979-01-04 1979-01-05 1979-01-06 1979-01-07 1979-01-08 1979-01-09
NA NA NA NA NA NA NA NA NA
b:
1988-08-15 1988-08-16 1988-08-17 1988-08-18 1988-08-19 1988-08-20 1988-08-21 1988-08-22 1988-08-23 1988-08-24 1988-08-25
NA NA NA NA NA NA NA NA NA NA NA
all I want to do is combine them in one time serie as a ZOO object, it seems to be a basic task but I am doing something wrong. I use the function "merge":
combined <- merge(a, b)
but the result is something in the form:
a b
1980-03-10 NA NA
1980-03-11 NA NA
1980-03-12 NA NA
1980-03-13 NA NA
1980-03-14 NA NA
1980-03-15 NA NA
1980-03-16 NA NA
.
.
which is not a time series, and the lengths dont fit:
> length(a)
[1] 10957
> length(b)
[1] 2557
> length(combined)
[1] 27028
how can I just combine them into one time series with the form of the original ones?

Assuming the series shown reproducibly in the Note at the end, the result of merging the two series has 20 times and 2 columns (one for each series). The individual series have lengths 9 and 11 elements and the merged series is a zoo object with 9 + 11 = 20 rows (since there are no intersecting times) and 2 columns (one for each input) and length 40 (= 20 * 2). Note that the length of a multivariate series is the number of elements in it, not the number of time points.
length(z1)
## [1] 9
length(z2)
## [1] 11
m <- merge(z1, z2)
class(m)
## [1] "zoo"
dim(m)
## [1] 20 2
nrow(m)
## [1] 20
length(index(m))
## [1] 20
length(m)
## [1] 40
If what you wanted is to string them out one after another then use c:
length(c(z1, z2))
## [1] 20
The above are consistent with how merge, c and length work in base R.
Note:
library(zoo)
z1 <- zoo(rep(NA, 9), as.Date(c("1979-01-01", "1979-01-02", "1979-01-03",
"1979-01-04", "1979-01-05", "1979-01-06", "1979-01-07", "1979-01-08",
"1979-01-09")))
z2 <- zoo(rep(NA, 11), as.Date(c("1988-08-15", "1988-08-16", "1988-08-17",
"1988-08-18", "1988-08-19", "1988-08-20", "1988-08-21", "1988-08-22",
"1988-08-23", "1988-08-24", "1988-08-25")))

Related

tsCV h-step-ahead when h>1

schematic 4-step-ahead forecasts
According to the illustration above, I'd expect the time period of the first CV error (first non-NA values) in the column h=4 to be period 10, right? The result below shows that all first CV errors start at period 7. Why is this so?
> data <- ts(rnorm(n = 50, mean = 10, sd = 5))
> tsCV(data, forecastfunction = splinef, h = 4, initial = 6) %>% head(12)
Time Series:
Start = 1
End = 12
Frequency = 1
h=1 h=2 h=3 h=4
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
7 -0.6898367 1.94707898 -0.4241705 2.6114473
8 2.2835535 -0.03213156 3.0590506 2.9266469
9 -1.0397064 1.90081726 1.6177550 4.7870414
10 2.3104741 2.08295460 5.3077838 5.1881762
11 1.2481952 4.36896765 4.1453033 3.9093216
12 3.9553215 3.68404796 3.4004571 0.4572387
Image source: https://otexts.com/fpp2/accuracy.html, Rob J Hyndman
From the help file for forecast::tsCV:
Value
Numerical time series object containing the forecast errors as a vector (if h=1) and a matrix otherwise. The time index corresponds to the last period of the training data. The columns correspond to the forecast horizons.
So the cell at time=7 and h=4 gives forecasts for time 11.

Fill data frame by column with for loop

I created an empty data frame with 11 columns and 15 rows and subsequently named the columns.
L_df <- data.frame(matrix(ncol = 11, nrow = 15))
names(L_df) <- paste0("L_por", 0:10)
w <- c(0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2, 2.2, 2.4, 2.6, 2.8, 3)
wu <- 0
L <- 333.7
pm <- c(2600, 2574, 2548, 2522, 2496, 2470, 2444, 2418, 2392, 2366, 2340)
The data frame looks like this:
head(L_df)
L_por0 L_por1 L_por2 L_por3 L_por4 L_por5 L_por6 L_por7 L_por8 L_por9 L_por10
1 NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA
Now, I would like to fill the data frame by column, based on a formula. I tried to express this with a nested for loop:
for (i in 1:ncol(L_df)) {
pm_tmp <- pm[i]
col_tmp <- colnames(L_df)[i]
for (j in 1:nrow(L_df)) {
w_tmp <- w[j]
L_por_tmp <- pm_tmp*L*((w_tmp-wu)/100)
col_tmp[j] <- L_por_tmp
}
}
For each column, I iterate over a predefined vector pm of length 11. For each row, I iterate over a predefined vector w of length 15 (repeats each column).
Example: First, select pm[1] for the first column. Second, select w[i] for each row in the first column. Store the formula in L_por_tmp and use it to fill the first column from row1 to row15. The whole procedure should start all over again for the second column (with pm[2]) with w[i] for each row and so on. wu and L are fixed in the formula.
R executes the code without an error. When I check the tmp values, they are correct. However, the data frame remains empty. L_df does not get filled. I would like solve this with a loop but if you have other solutions, I am happy to hear them! I get the impression there might be a smoother way of doing this. Cheers!
Solution
L_df <- data.frame(sapply(pm, function(x) x * L * ((w - wu) / 100)))
names(L_df) <- c("L_por0", "L_por1", "L_por2", "L_por3", "L_por4", "L_por5",
"L_por6", "L_por7", "L_por8", "L_por9", "L_por10")
L_df
L_por0 L_por1 L_por2 L_por3 L_por4 L_por5 L_por6 L_por7
1 1735.24 1717.888 1700.535 1683.183 1665.830 1648.478 1631.126 1613.773
2 3470.48 3435.775 3401.070 3366.366 3331.661 3296.956 3262.251 3227.546
3 5205.72 5153.663 5101.606 5049.548 4997.491 4945.434 4893.377 4841.320
4 6940.96 6871.550 6802.141 6732.731 6663.322 6593.912 6524.502 6455.093
5 8676.20 8589.438 8502.676 8415.914 8329.152 8242.390 8155.628 8068.866
6 10411.44 10307.326 10203.211 10099.097 9994.982 9890.868 9786.754 9682.639
7 12146.68 12025.213 11903.746 11782.280 11660.813 11539.346 11417.879 11296.412
8 13881.92 13743.101 13604.282 13465.462 13326.643 13187.824 13049.005 12910.186
9 15617.16 15460.988 15304.817 15148.645 14992.474 14836.302 14680.130 14523.959
10 17352.40 17178.876 17005.352 16831.828 16658.304 16484.780 16311.256 16137.732
11 19087.64 18896.764 18705.887 18515.011 18324.134 18133.258 17942.382 17751.505
12 20822.88 20614.651 20406.422 20198.194 19989.965 19781.736 19573.507 19365.278
13 22558.12 22332.539 22106.958 21881.376 21655.795 21430.214 21204.633 20979.052
14 24293.36 24050.426 23807.493 23564.559 23321.626 23078.692 22835.758 22592.825
15 26028.60 25768.314 25508.028 25247.742 24987.456 24727.170 24466.884 24206.598
L_por8 L_por9 L_por10
1 1596.421 1579.068 1561.716
2 3192.842 3158.137 3123.432
3 4789.262 4737.205 4685.148
4 6385.683 6316.274 6246.864
5 7982.104 7895.342 7808.580
6 9578.525 9474.410 9370.296
7 11174.946 11053.479 10932.012
8 12771.366 12632.547 12493.728
9 14367.787 14211.616 14055.444
10 15964.208 15790.684 15617.160
11 17560.629 17369.752 17178.876
12 19157.050 18948.821 18740.592
13 20753.470 20527.889 20302.308
14 22349.891 22106.958 21864.024
15 23946.312 23686.026 23425.740
Explanation
The sapply() function can be used to iterate over vectors in a more idiomatic way for R programming. We iterate over pm and use your formula once since R is vectorised; each time it creates a vector of length 15 (so 11 vectors of length 15), and when we wrap it in data.frame() returns the data frame you want and we add in the column names.
NOTE: Applying functions to every element of a vector using an apply() family function has some different implications than iterating using for loops. In your case, I think sapply() is easier and more understandable. For more information on when you need a loop or when something like apply is better, see for example this discussion from Hadley Wickham's Advanced R book.
You are just doing a small mistake and you were almost there, Edited your function:
for (i in 1:ncol(L_df)) {
pm_tmp <- pm[i]
col_tmp <- colnames(L_df)[i]
for (j in 1:nrow(L_df)) {
w_tmp <- w[j]
L_por_tmp <- pm_tmp*L*((w_tmp-wu)/100)
L_df[ j ,col_tmp] <- L_por_tmp ##You must have used df[i, j] referencing here
}
}
Output:
Just printing the head of few rows:
L_df
L_por0 L_por1 L_por2 L_por3 L_por4 L_por5 L_por6 L_por7 L_por8 L_por9 L_por10
1 1735.24 1717.888 1700.535 1683.183 1665.830 1648.478 1631.126 1613.773 1596.421 1579.068 1561.716
2 3470.48 3435.775 3401.070 3366.366 3331.661 3296.956 3262.251 3227.546 3192.842 3158.137 3123.432
3 5205.72 5153.663 5101.606 5049.548 4997.491 4945.434 4893.377 4841.320 4789.262 4737.205 4685.148

Create multiple columns from a single column and clean up results

I have a data frame like this:
foo=data.frame(Point.Type = c("Zero Start","Zero Start", "Zero Start", "3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","Zero Stop","Zero Start"),
Point.Value = c(NA,NA,NA,rnorm(3),NA,NA))
I want to add three columns, by splitting the first column with separator _, and retain only the numeric values obtained after the split. For those rows where the first column doesn't contain any _, the three new columns should be NA. I got somewhat close using separate, but that's not enough:
> library(tidyr)
> bar = separate(foo,Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"), sep="_", remove = FALSE, extra="drop", fill="right")
> bar
Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1 Zero Start Zero Start <NA> <NA> NA
2 Zero Start Zero Start <NA> <NA> NA
3 Zero Start Zero Start <NA> <NA> NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG -1.468033
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG 1.280868
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG 0.270126
7 Zero Stop Zero Stop <NA> <NA> NA
8 Zero Start Zero Start <NA> <NA> NA
I'm not sure why my data frame contains now two apparently different kinds of NA, but is.na seems to like them both, so I can live with that. However, I have two kind of problems:
the new columns should be at least numeric, and possibly integer. Instead they're character, because of the trailing rpm, %, barG. How do I get rid of those?
when Point.Type can't be split, rpm_nom should be NA, instead it becomes Zero Start or Zero Stop. Changing the fill= option only changes which one of the new columns get the Zero Start/Zero Stop. Instead I want all three of them to be NA. How can I do that?
NOTE: I'm using tidyr, but of course you don't need to, if you think there's a better way to do this.
You can post-process the columns with dplyr:
library(dplyr)
foo <- foo %>%
separate(Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"),
sep="_", remove = FALSE, extra="drop", fill="right") %>%
mutate_each(funs(as.numeric(gsub("[^0-9]","",.))), rpm_nom, GVF_nom, p0in_nom)
The gsub("[^0-9]","",.)-part removes all non-numeric characters. If you want to prevent the removal of decimal points, you can use [^0-9.] instead of [^0-9] (like #PierreLafortune used in his answer), but be aware that this will also include points that are not meant to be decimal points. By wrapping it in as.numeric, you convert them to numeric values while at the same time transforming the empty cells to NA. This gives the following result:
> foo
Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1 Zero Start NA NA NA NA
2 Zero Start NA NA NA NA
3 Zero Start NA NA NA NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 -1.2361145
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 -0.8727960
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.9685555
7 Zero Stop NA NA NA NA
8 Zero Start NA NA NA NA
Or using data.table (as contributed by #DavidArenburg in the comments):
library(data.table)
setDT(foo)[, c("rpm_nom","GVF_nom","p0in_nom") :=
lapply(tstrsplit(Point.Type, "_", fixed = TRUE)[1:3],
function(x) as.numeric(gsub("[^0-9]","",x)))
]
will give a similar result:
> foo
Point.Type Point.Value rpm_nom GVF_nom p0in_nom
1: Zero Start NA NA NA NA
2: Zero Start NA NA NA NA
3: Zero Start NA NA NA NA
4: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww -0.09255445 3000 10 13
5: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 1.18581340 3000 10 13
6: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 2.14475950 3000 10 13
7: Zero Stop NA NA NA NA
8: Zero Start NA NA NA NA
The advantage of this is that foo is updated by reference. As this is faster and more memory efficient, this is especially valuable for using with large datasets.
With base R we can first coerce NA values where necessary and coerce class numeric:
bar[-1] <- lapply(bar[-1], function(x) {
is.na(x) <- grepl("Zero", x)
as.numeric(gsub("[^0-9.]", "", x))})
# Point.Type rpm_nom GVF_nom p0in_nom Point.Value
# 1 Zero Start NA NA NA NA
# 2 Zero Start NA NA NA NA
# 3 Zero Start NA NA NA NA
# 4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.3558397
# 5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 1.1454829
# 6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.2958815
# 7 Zero Stop NA NA NA NA
# 8 Zero Start NA NA NA NA
To reduce to one line (#Jaap):
bar[-1] <- lapply(bar[-1], function(x) as.numeric(gsub("[^0-9.]", "", x)))

Stop a looping function when one value is greater than another within loop

I have been trying to write a while command to stop the looping function when one value generated by the loop exceeds the other. However, I have failed to figure out the proper way to do it.
The for loop runs for 30 days, but I want it to stop as soon as the last value of parasite_l.A is less than than parasite_l.B.
I have included the working code I have for generating the data and the for loop.
Alternative solutions without a limit on the loop would also be greatly appreciated.
# Subject A, initially 400 parasites, growing by 10 %
subA = 400
infA = 1.1
# Subject B, initially 120 parasites, growing by 20 %
subB = 120
infB = 1.2
# How many days to model
days = 30
days_seq = seq(1, days, 1)
# Parasite load for A
parasite_l.A = rep(NA, days)
parasite_l.A[1] = subA
# Parasite load for B
parasite_l.B = rep(NA, days)
parasite_l.B[1] = subB
# Loop for subject A and B
for(i in 1:(days)){
parasite_l.A[i+1] = parasite_l.A[i]*(infA)
parasite_l.B[i+1] = parasite_l.B[i]*(infB)
}
parasite_l.A
parasite_l.B
There is a built-in control flow function for what you are referring to named while. As long as the conditions are met it will continue to loop.
i <- 1
while (parasite_l.A[i] > parasite_l.B[i]) {
parasite_l.A[i+1] = parasite_l.A[i]*(infA)
parasite_l.B[i+1] = parasite_l.B[i]*(infB)
i <- i + 1
}
# parasite_l.A
# [1] 400.0000 440.0000 484.0000 532.4000 585.6400 644.2040 708.6244
# [8] 779.4868 857.4355 943.1791 1037.4970 1141.2467 1255.3714 1380.9085
# [15] 1518.9993 NA NA NA NA NA NA
# [22] NA NA NA NA NA NA NA
# [29] NA NA
# parasite_l.B
# [1] 120.0000 144.0000 172.8000 207.3600 248.8320 298.5984 358.3181
# [8] 429.9817 515.9780 619.1736 743.0084 891.6100 1069.9321 1283.9185
# [15] 1540.7022 NA NA NA NA NA NA
# [22] NA NA NA NA NA NA NA
# [29] NA NA
Use an index value (i), a couple of counters (A.index.value, B.index.value), and a while loop:
# Subject A, initially 400 parasites, growing by 10 %
subA <- A.index.value <- 400
infA <- 1.1
# Subject B, initially 120 parasites, growing by 20 %
subB <- B.index.value <- 120
infB <- 1.2
# How many days to model
days <- 30
days_seq <- seq(1, days, 1)
# Parasite load for A
parasite_l.A <- rep(NA, days)
parasite_l.A[1] <- subA
# Parasite load for B
parasite_l.B <- rep(NA, days)
parasite_l.B[1] <- subB
# While Loop for subject A and B
i <- 1
while (A.index.value > B.index.value) {
parasite_l.A[i+1] <- A.index.value <- parasite_l.A[i]*(infA)
parasite_l.B[i+1] <- B.index.value <- parasite_l.B[i]*(infB)
i <- i + 1
}
parasite_l.A
parasite_l.B
With the results being:
> parasite_l.A
[1] 400.00 440.00 484.00 532.40 585.64 644.20 708.62 779.49 857.44 943.18 1037.50
[12] 1141.25 1255.37 1380.91 1519.00 NA NA NA NA NA NA NA
[23] NA NA NA NA NA NA NA NA
> parasite_l.B
[1] 120.00 144.00 172.80 207.36 248.83 298.60 358.32 429.98 515.98 619.17 743.01
[12] 891.61 1069.93 1283.92 1540.70 NA NA NA NA NA NA NA
[23] NA NA NA NA NA NA NA NA
>
if (parasite_l.A < parasite_l.B) { // if parasite a is less than b, do the following //
for(i in 1:(days)){
parasite_l.A[i+1] = parasite_l.A[i]*(infA)
parasite_l.B[i+1] = parasite_l.B[i]*(infB)
}
}
Use inside the loop something like:
if (parasite_l.A > parasite_l.B) {
break
}

subsetting error in R

I have a large dataframe called dualbeta which contains 2 rows and 6080 columns. Here is a sample:
row.names A.Close AA.Close AADR.Close AAIT.Close AAL.Close
1 upside 1.253929 0.9869027 0.6169613 0.6353903 0.1782124
2 downside 1.027412 1.1936236 0.5915299 0.5697878 0.1702382
I am trying to extract only those with the upside >= 1.00 and those with a downside <=1.00. I used combinations <- subset(dualbeta, upside>=1.00 & downside<=1.00) but i get the following:
row.names A.Close AA.Close AADR.Close AAIT.Close
1 NA NA NA NA NA
2 NA.1 NA NA NA NA
3 NA.2 NA NA NA NA
4 NA.3 NA NA NA NA
5 NA.4 NA NA NA NA
...
It should just return a 2 by x table where x is the number of combinations found. I do not know why I am getting a bunch of rows? Additionally, i thought i had NA values in the dualbeta so i used na.omit(dualbeta)->dualbeta but it deleted everything & turned dualbeta into a 0 by 6080. I also used which(is.na(dualbeta)) which returned 3307 and 3308 but when i checked those columns, they did not contain NAs.
You might work on the transpose of the data in order to select rows with the proper characteristics (which are columns in the transpose):
# Fix up the data, use proper row names
rownames(x) <- x$row.names
# Remove old row name column
x <- x[-1]
# transpose and subset
subset(data.frame(t(x)), upside > 1 & downside < 1)
This expression returns a zero-length result with your example data. Changing the parameters shows what is returned:
subset(data.frame(t(x)), upside > .6 & downside < .6)
## upside downside
## AADR.Close 0.6169613 0.5915299
## AAIT.Close 0.6353903 0.5697878
You can the data with simple indexing.
Let's say this is your data
dualbeta<-data.frame(matrix(runif(24,0,2),
nrow=2,
dimnames=list(c("upside","downside"), letters[1:12])))
then you can extract with
dualbeta[, dualbeta[1,]>=1.00 & dualbeta[2,]<=1.00]

Resources