R lag over missing data

R lag over missing data - r

Is there a variant of lag somewhere that keeps NAs in position? I want to compute returns of price data where data could be missing.
Col 1 is the price data
Col 2 is the lag of price
Col 3 shows p - lag(p) - the return from 99 to 104 is effectively missed, so the path length of the computed returns will differ from the true.
Col 4 shows the lag with NA position preserved
Col 5 shows the new difference - now the return of 5 for 2009-11-07 is available
Cheers, Dave
x <- xts(c(100, 101, 97, 95, 99, NA, 104, 103, 103, 100), as.Date("2009-11-01") + 0:9)
# fake the lag I want, with NA kept in position
x.pos.lag <- lag.xts(x.pos)
x.pos.lag <- lag.xts(x.pos)
x.pos.lag['2009-11-07']=99
x.pos.lag['2009-11-06']=NA
cbind(x, lag.xts(x), x - lag.xts(x), x.pos.lag, x-x.pos.lag)
..1 ..2 ..3 ..4 ..5
2009-11-01 100 NA NA NA NA
2009-11-02 101 100 1 100 1
2009-11-03 97 101 -4 101 -4
2009-11-04 95 97 -2 97 -2
2009-11-05 99 95 4 95 4
2009-11-06 NA 99 NA NA NA
2009-11-07 104 NA NA 99 5
2009-11-08 103 104 -1 104 -1
2009-11-09 103 103 0 103 0
2009-11-10 100 103 -3 103 -3

There are no functions to do that natively in R, but you can create an index of the original NA positions and then swap the values there after the lag.
x <- xts(c(100, 101, 97, 95, 99, NA, 104, 103, 103, 100), as.Date("2009-11-01") + 0:9)
lag.xts.na <- function(x, ...) {
na.idx <- which(is.na(x))
x2 <- lag.xts(x, ...)
x2[na.idx+1,] <- x2[na.idx,]
x2[na.idx,] <- NA
return(x2)
}
lag.xts.na(x)
[,1]
2009-11-01 NA
2009-11-02 100
2009-11-03 101
2009-11-04 97
2009-11-05 95
2009-11-06 NA
2009-11-07 99
2009-11-08 104
2009-11-09 103
2009-11-10 103
Incidentally, are you just trying to deal with weekends/holidays or something along that line? If so, you might consider dropping those positions from your series; that will dramatically simplify things for you. Alternatively, the timeSeries package in Rmetrics has a number of functions to deal with business days.

Related

I can differentiate between months using pairs in R

data(airquality)
a=airquality
convert_fahr_to_kelvin <- function(temp) {
kelvin <- ((temp - 32) * (5 / 9)) + 273.15
return(kelvin)
}
a[,4]=
convert_fahr_to_kelvin(a[,4])
oz=a[,1]
sr=a[,2]
wv=a[,3]
te=a[,4]
pairs(~oz+sr+wv+te,
col = c("orange") ,
pch = c(18),
labels = c("Ozono", "Irradiancia Solar", "Velocidad del viento","Temperatura"),
main = "Diagramas de dispersión por parejas")
This is the graphic that i get
This is what i am doing but, actually i would like to differentiate between months, like 31 first numbers of my a matrix, from all columns, be color green, for example and this for each month, i tried to separate the numbers in groups using group:
group <- NA
group[sr[1:31]]<-1
group[sr[32:61]]<-2
group[sr[62:92]]<-3
group[sr[93:123]]<-4
group[sr[124:153]]<-5
group[sr[1:31]]
group[sr[32:61]]
group[sr[62:92]]
group[sr[93:123]]
group[sr[124:153]]
here the numbers repeated
But what i get is that if the numbers in each column are the same they get in the same group, and i have been trying to solve it in other ways but i don't finally get what i want.

It is easier to create a group with gl
group <- as.integer(gl(length(sr), 31, length(sr)))
table(group)
#group
#1 2 3 4 5
#31 31 31 31 29
In the OP's code, 'group' is initialized as NA of length 1. Then, it is assigned based on values of 'sr' instead of just
group <- integer(length(sr))
group[1:31] <- 1
group[32:61] <- 2
...
whereas if we use sr values as index
sr[1:31]
#[1] 190 118 149 313 NA NA 299 99 19 194 NA 256 290 274 65 334 307 78 322 44 8 320 25 92 66 266 NA 13 252 223 279
then group values that are changed to 1 are at positions 190, 118, 149, 313, ....

Calculate second highest cumulative value by group

I have data with a grouping variable 'grps' and a value 'x'. I have calculated the cummax within each group 'cmx'. Now I need to find the second highest cumulative value of 'x' within each group, scmx.
Some data, including the desired column scmx:
library(data.table)
d = structure(list(date = structure(rep(c(18690, 18691, 18692, 18693, 18694, 18695, 18696, 18697), 2), class = "Date"),
x = c(18, 70, 57, 94, 94, 13, 98, 23, 20, 72, 59, 96, 96, 15, 100, 25),
grps = c(rep("g1", 8), rep("g2", 8))),
row.names = c(NA, -16L), class = c("data.table", "data.frame"))
d[, cmx := cummax(x), by = .(grps)]
d[, scmx := c(18, 18, 57, 70, 70, 70, 94, 94, 20, 20, 59, 72, 72, 72, 96, 96)]
Context
If x corresponds to a performance rating, what I am trying to do is locate the date when they achieved their best performance and their second best. A similar question of mine where I needed to locate the row which corresponded to the highest cumulative value in a column:
Fill down first row within each cumulative max, with a twist

A data.table alternative:
d[ , scmx2 := {
c(x[1], sapply(seq(.N)[-1], function(i){
v = x[1:i]
v[frank(-v, ties.method = "dense") == 2][1]
}))
}, by = grps]
# date x grps cmx scmx scmx2
# 1: 2021-03-04 18 g1 18 18 18
# 2: 2021-03-05 70 g1 70 18 18
# 3: 2021-03-06 57 g1 70 57 57
# 4: 2021-03-07 94 g1 94 70 70
# 5: 2021-03-08 94 g1 94 70 70
# 6: 2021-03-09 13 g1 94 70 70
# 7: 2021-03-10 98 g1 98 94 94
# 8: 2021-03-11 23 g1 98 94 94
# 9: 2021-03-04 20 g2 20 20 20
# 10: 2021-03-05 72 g2 72 20 20
# 11: 2021-03-06 59 g2 72 59 59
# 12: 2021-03-07 96 g2 96 72 72
# 13: 2021-03-08 96 g2 96 72 72
# 14: 2021-03-09 15 g2 96 72 72
# 15: 2021-03-10 100 g2 100 96 96
# 16: 2021-03-11 25 g2 100 96 96
Within each group (by = grps), loop (sapply) over a sequence from 2 to number of rows in the current group (seq(.N)[-1]). In each step, subset 'x' from start of the vector to the index 'i' (v = x[1:i]).
Calculate dense rank and check if the rank is 2 (frank(-v, ties.method = "dense") == 2), i.e. the rank of the second largest number. Use the logical indices to subset 'v' (v[...). Select the first match ([1]; in case of several values with rank 2). Concatenate the result from this 'expanding window' with the first element of 'x' (c(x[1], ...).
In the first window, with only one value, there is clearly no second highest value. Here OP have chosen to return the first value. The same choice needs to be made also for longer windows where all values are equal, which will occur when there are leading runs of equal values. If we rather want to return NA than the first value, then replace the x[1] in the line
c(x[1], sapply(seq(.N)[-1], function(i){
...with NA_real_.
Small demo:
d = data.table(grps = c(1, 1, 2, 2, 2), x = c(3, 3, 4, 4, 5))
d[ , scmx2 := {
c(NA_real_, sapply(seq(.N)[-1], function(i){
v = x[1:i]
v[frank(-v, ties.method = "dense") == 2][1]
}))
}, by = grps]
# grps x scmx
# 1: 1 3 NA # grp 1: all values equal in all windows -> all NA
# 2: 1 3 NA
# 3: 2 4 NA
# 4: 2 4 NA
# 5: 2 5 4 # grp 2: only the last window has a second highest value
This question is indeed similar to the post I linked to above (Finding cumulative second max per group in R). However, here OP asked for a data.table solution.

Here is another option using non-equi join:
d[, s2 := .SD[.SD, on=.(grps, date<=date, x<cmx), by=.EACHI, max(x.x)]$V1]
d[is.na(s2), s2 := x][]
output:
date x grps cmx scmx s2
1: 2021-03-04 18 g1 18 18 18
2: 2021-03-05 70 g1 70 18 18
3: 2021-03-06 57 g1 70 57 57
4: 2021-03-07 94 g1 94 70 70
5: 2021-03-08 94 g1 94 70 70
6: 2021-03-09 13 g1 94 70 70
7: 2021-03-10 98 g1 98 94 94
8: 2021-03-11 23 g1 98 94 94
9: 2021-03-04 20 g2 20 20 20
10: 2021-03-05 72 g2 72 20 20
11: 2021-03-06 59 g2 72 59 59
12: 2021-03-07 96 g2 96 72 72
13: 2021-03-08 96 g2 96 72 72
14: 2021-03-09 15 g2 96 72 72
15: 2021-03-10 100 g2 100 96 96
16: 2021-03-11 25 g2 100 96 96

Create a sequence that is the length of the column x. Apply the function to each sequence in x that is from index 1 to the current number in the sequence, only caring about the unique values. Rfast::nth can be used to take the 2nd highest number in a vector.
library(Rfast)
sapply(seq(length(d$x)), function(x) {
return(nth(unique(d$x[1:x]), 2, descending=TRUE))
})
[1] 2.652495e-315 1.800000e+01 5.700000e+01 7.000000e+01
[5] 7.000000e+01 7.000000e+01 9.400000e+01 9.400000e+01
To do it for the new data frame. We can still use the function created above. Arrange the data frame so that the group names and values are in their own column, then use lapply with rollapplyr to capture the 2nd largest unique value.
d1=d %>% select(-cmx) %>%
pivot_wider(names_from=grps, values_from=x)
lapply(d1[-1], function(x) {
my_list=rollapplyr(x, seq(length(x)), function(x) {return(nth(sort(unique(x), decreasing=TRUE), 2))})
return(my_list)
})

In a series of series, how to subtract every 1st number in each sub-series event from every nth number in those events?

I have multiple series of timepoints. Some series have five timepoints, others have ten or fifteen timepoints. The series are in multiples of five because the event I am measuring is always five timepoints long; some recordings have multiple events in succession. For instance:
Series 1:
0
77
98
125
174
Series 2:
0
69
95
117
179
201
222
246
277
293
0 marks the beginning of each series. Series 1 is a single event, but Series 2 is two events in succession. The 6th timepoint in Series 2 is the start of the second event in that series.
I have an R dataframe that contains every timepoint in one column:
dd <- data.frame(
timepoint=c(0, 77, 98, 125, 174,
0, 69, 95, 117, 179, 201, 222, 246, 277, 293)
)
I need to know the duration from the start of each event to the 4th timepoint in each event. For the above data, that means:
Duration 1: 125 - 0 = 125
Duration 2: 179 - 0 = 179
Duration 3: 277 - 201 = 76
How can I write a simple piece of R code that will tell me the duration of that interval regardless of how many series or events there are, i.e. regardless of how many numbers are in the column?
I tried using diff() and seq_along(), but that seems only useful for every nth number, which doesn't work in this case.
diff(vec[seq_along(vec) %% 4 == 1])

This is maybe one way to do it with dplyr. We break up the data into "runs" which reset at each 0 and them we have the "sequences" which reset each 5 values.
dd %>%
group_by(run =cumsum(timepoint==0)) %>%
mutate(seq = (row_number()-1) %/% 5 + 1) %>%
group_by(run, seq) %>%
summarize(diff=timepoint[4]-timepoint[1])
# run seq diff
# <int> <dbl> <dbl>
# 1 1 1 125
# 2 2 1 117
# 3 2 2 76
It makes it somewhat easy to tie the value back to where it came from.
If you just wanted to use indexing, here's a helper function
diff4v1 <- function(x) {
idx <- (seq_along(x)-1) %% 5+1;
x[idx==4] - x[idx==1]
}
diff4v1(dd$timepoint)
# [1] 125 117 76

This is your data frame (hypothetical)
df = data.frame(series = round(rnorm(40, 100, 50)))
head(df)
series
1 16
2 35
3 75
4 125
5 190
6 85
And these are your differences
idx = c(1:nrow(df))
df[which(idx %% 5 == 4), "series"] - df[which(idx %% 5 == 1), "series"]
[1] 109 -38 -101 -47 34 -52 -63 -5

If() statement in R

I am not very experienced in if statements and loops in R.
Probably you can help me to solve my problem.
My task is to add +1 to df$fz if sum(df$fz) < 450, but in the same time I have to add +1 only to max values in df$fz till that moment when when sum(df$fz) is lower than 450
Here is my df
ID_PP <- c(3,6, 22, 30, 1234456)
z <- c(12325, 21698, 21725, 8378, 18979)
fz <- c(134, 67, 70, 88, 88)
df <- data.frame(ID_PP,z,fz)
After mutating the new column df$new_value, it should look like 134 68 71 88 89
At this moment I have this code, but it adds +1 to all values.
if (sum(df$fz ) < 450) {
mutate(df, new_value=fz+1)
}
I know that I can pick top_n(3, z) and add +1 only to this top, but it is not what I want, because in that case I have to pick a top manually after checking sum(df$fz)

From what I understood from #Oksana's question and comments, we probably can do it this way:
library(tidyverse)
# data
vru <- data.frame(
id = c(3, 6, 22, 30, 1234456),
z = c(12325, 21698, 21725, 8378, 18979),
fz = c(134, 67, 70, 88, 88)
)
# solution
vru %>% #
top_n(450 - sum(fz), z) %>% # subset by top z, if sum(fz) == 450 -> NULL
mutate(fz = fz + 1) %>% # increase fz by 1 for the subset
bind_rows( #
anti_join(vru, ., by = "id"), # take rows from vru which are not in subset
. # take subset with transformed fz
) %>% # bind thous subsets
arrange(id) # sort rows by id
# output
id z fz
1 3 12325 134
2 6 21698 68
3 22 21725 71
4 30 8378 88
5 1234456 18979 89

The clarifications in the comments helped. Let me know if this works for you. Of course, you can drop the cumsum_fz and leftover columns.
# Making variables to use in the calculation
df <- df %>%
arrange(fz) %>%
mutate(cumsum_fz = cumsum(fz),
leftover = 450 - cumsum_fz)
# Find the minimum, non-negative value to use for select values that need +1
min_pos <- min(df$leftover[df$leftover > 0])
# Creating a vector that adds 1 using the min_pos value and keeps
# the other values the same
df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
# Checking the sum of the new value
> sum(df$new_value)
[1] 450
>
> df
ID_PP z fz cumsum_fz leftover new_value
1 6 21698 67 67 383 68
2 22 21725 70 137 313 71
3 30 8378 88 225 225 89
4 1234456 18979 88 313 137 88
5 3 12325 134 447 3 134
EDIT:
Because utubun already posted a great tidyverse solution, I am going to translate my first one completely to base (it was a bit sloppy to mix the two anyway). Same logic as above, and using the data OP provided.
> # Using base
> df <- df[order(fz),]
>
> leftover <- 450 - cumsum(fz)
> min_pos <- min(leftover[leftover > 0])
> df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
>
> sum(df$new_value)
[1] 450
> df
ID_PP z fz new_value
2 6 21698 67 68
3 22 21725 70 71
4 30 8378 88 89
5 1234456 18979 88 88
1 3 12325 134 134

Divide a vector by different values based on the result of the division

I have a Df like this:
x y z
<dbl> <dbl> <dbl>
1 408001.9 343 0
2 407919.2 343 0
3 407839.6 343 0
4 407761.2 343 0
5 407681.7 343 0
6 407599.0 343 0
7 407511.0 343 0
8 407420.5 343 0
9 407331.0 343 0
10 407242.0 343 0
11 407152.7 343 0
12 407062.5 343 0
13 406970.7 343 0
14 406876.6 342 0
15 406777.1 342 0
16 406671.0 342 0
17 406560.9 342 0
18 406449.4 342 0
19 406339.0 342 0
20 406232.5 342 0
... ... ... ...
with x decreasing.
And a vector like
vec=(a1, a2, a3, a4, a5, a6, ...)
with a1< a2< a3< a4...
Now I want to divide df$x by vec[1], what will give the same result (rounded) as for df$y.
But now, when the value in df$z drops by one to 342, I want to divide the value in df$x by vec[2] from then on, to get the new df$z values.
From here the result will be different from df$y, as for df$y the number to divide with is allways vec[1]and will not change
Every time the value I get for df$z drops by one, the next values for df$z shal be calculated with the corresponding vec[i] where i is the number of drops+1 so far
In the end I want a vector df$z, where the values are df$x / vec[i], where vec [i] depends on, what the last number of df$z is.
reproducible example:
test <- data.frame(x = sort((seq(500, 600, 2)), decreasing = T)
)
vec <- seq(10, 10.9, 0.03)
for(i in 1:31){
test[i+1] <- round(test$x/vec[i])
}
This will give you a df with one col for every value of vec, that test$x got divided by.
Now, in the end, my vector shall contain the values of col2 until the value in col2 drops from 60 to 59. Afterwards I want the values from col3 until the value in col3 drops below 59 to 58. Then I want the values from col4 and so on.
How can I achive this with any data(like mine above, which is not linear ditributed as this example.)
I tried some for and while loops, but none worked. I didn't even get close to what I want.
I think my problem is that I dont know how to make the condition depenent on a value(the value of df$z at point i), that I want to calculate in the same operation. I want to calculate the value of df$z[i] with the value of vec[t], that has been used so far. But if the value of df$z drops by one at a certain observation[i], the value of vec[t+1] shall be used for the division from then on.
Thanks for your help.

I hope I've understood what you are asking. This might be it...
test <- data.frame(x = sort((seq(500, 600, 2)), decreasing = T)
vec <- seq(10, 10.9, 0.03)
#this function determines the index of `vec` to use
xcol<-function(v){
x<-rep(NA,length(v))
x[1] <- 1
for(i in 2:length(v)){
x[i] <- x[i-1]
if(round(v[i]/vec[x[i]])<round(v[i-1]/vec[x[i]])){
x[i] <- x[i]+1
}
}
return(x)
}
test$xcol <- xcol(test$x)
test$z <- round(test$x/vec[test$xcol])
test
x xcol z
1 600 1 60
2 598 1 60
3 596 1 60
4 594 2 59
5 592 2 59
6 590 2 59
7 588 2 59
8 586 3 58
9 584 3 58
10 582 3 58
11 580 3 58
12 578 4 57
...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R lag over missing data - r

Related

I can differentiate between months using pairs in R

Calculate second highest cumulative value by group

In a series of series, how to subtract every 1st number in each sub-series event from every nth number in those events?

If() statement in R

Divide a vector by different values based on the result of the division

Categories

Resources