R - End a loop on a fixed number of rows - r

I have a loop that begins when a certain condition exits. It begins when the value is below -.1.0 The loop exits when the condition is over 1.0
#Loop for long entry and exit signal
entry <- -1.0 #Input value for entry
exit <- 1.o #Input value for exit
mydf$sig_long[[1]] = ifelse(mydf$roll.z.score.n3[[1]] < (entry), 1, 0)
for (i in 2:nrow(mydf)){
mydf$sig_long[i] = ifelse(mydf$roll.z.score.n3[i] < (entry), 1,
ifelse(mydf$roll.z.score.n3[i] > (exit), 0,
mydf$sig_long[i-1]))
}
I wanted to learn how to use the same loop but instead of exit on values > 1.0 Exit after a set number of lines. If for example I set nline variable to 5. It would loop and print 1, for a total of 5 lines after the initial entry. An example data frame below:
roll.z.score.n3
1 0
2 0
3 0.651651537
4 -1.153593891
5 -0.926552368
6 -0.369016769
7 0.65405595
8 -1.139305279
9 0.358231351
10 1.135314685
11 0.944997472
12 -0.293105191
13 -1.146659778
14 -0.66246734
15 -1.131901741
16 -0.600480649
17 -1.152333435
18 1.1025176
19 -0.144684006
20 -0.678000883
21 -1.146875039
22 -1.132235788
23 0.115583229
24 0.645489447
25 1.148754398
26 0.988193418
27 -0.818892395
After the script has run I would use zoo to +1 line the sig_long column.
new.df$sig_long <- Lag(new.df$sig_long,1)
Code Testing
Ok so i see the issue. As we are specifying the first part of the loop to enter and exit between -1.0 and 1.0 this prints 1's between those two criteria in the new.df$sig_long column. The issue arises if I set the nlines to 5. The entry/exit might exit the trade in 3 lines in the sig_long column. If that is the case, rollsum has no 1's to count in that column, even if i want to hold for 5 lines, if theres only 3 1's from entry... its not going to be able to compute the 5 line hold time. Perhaps we can use the first part of the loop like this in order to print a 1 at the entry condition:
new.df$sig_long <- ifelse(new.df$roll.z.score.n3 < -1.0 , 1, 0) #Set 1 at entry criteria
That sets the 1 at each < -1.0 value. Next would be how to count +5 lines from that point forward. Counting until...1,2,3,4,5, else 0... 0, until next +1.... count until 1,2,3,4,5, else 0, 0 until next +1....
I can see the logic... if newdf$sig_long == 1, count until nlines 5, else 0, repeat...
Then the 2nd part of the loop would work I think for the new_sig_long column

Made many changes. This should now work:
zz = '
roll.z.score.n3 sig_long
1 NA 0
2 NA 0
3 -1.135974424 0
4 0.193311168 1
5 0.714285714 1
6 -1.148753543 1
7 -0.942160394 1
8 -0.695763683 1
9 1.140646864 1
10 0.985196899 0
11 -0.768766574 0
12 -1.011293858 0
13 -0.516703612 1
14 -1.120897077 1
15 1.091089451 1
16 0.968364052 0
17 0.872871561 0
18 1.099524999 0
19 0.918397948 0
'
df <- read.table(text = zz, header = TRUE)
df = na.omit(df)
df$sig_long[[1]] = ifelse(df$roll.z.score.n3[[1]] < (-1), 1, 0)
for (i in 2:nrow(df)){
df$sig_long[i] = ifelse(df$roll.z.score.n3[i] < (-1), 1,
ifelse(df$roll.z.score.n3[i] > 1, 0,
df$sig_long[i-1]))
}
df$sig_long <- Lag(df$sig_long, 1)
colnames(df[, 2]) = "sig_long"
df = na.omit(df)
nlines = 5
df$rollsum = rollsumr(df$sig_long, k = nlines, fill = 0)
colnames(df[, 3]) = "rollsum"
df$new_sig_long[[1]] = ifelse(df$roll.z.score.n3[[1]] < (-1), 1, 0)
for (i in 2:nrow(df)){
df$new_sig_long[i] = ifelse(df$roll.z.score.n3[i] < (-1) & df$rollsum[i] < 5, 1,
ifelse(df$roll.z.score.n3[i] > 1 | df$rollsum[i] >= 5, 0,
df$sig_long[i-1]))
}
df
Output:
> df
roll.z.score.n3 sig_long rollsum new_sig_long
# 4 0.1933112 1 0 0
# 5 0.7142857 1 0 1
# 6 -1.1487535 1 0 1
# 7 -0.9421604 1 0 1
# 8 -0.6957637 1 5 0
# 9 1.1406469 1 5 0
# 10 0.9851969 0 4 1
# 11 -0.7687666 0 3 0
# 12 -1.0112939 0 2 1
# 13 -0.5167036 1 2 0
# 14 -1.1208971 1 2 1
# 15 1.0910895 1 3 0
# 16 0.9683641 0 3 1
# 17 0.8728716 0 3 0
# 18 1.0995250 0 2 0
# 19 0.9183979 0 1 0
Then you can take the lag of the new_sig_long if you wish.

Related

Create a time to and time after event variables

I am working on panel data that looks like this:
d <- data.frame(id = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c"),
time = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
iz = c(0,1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1))
id time iz
1 a 1 0
2 a 2 1
3 a 3 1
4 a 4 0
5 a 5 0
6 b 1 0
7 b 2 0
8 b 3 0
9 b 4 0
10 b 5 1
11 c 1 0
12 c 2 0
13 c 3 0
14 c 4 1
15 c 5 1
Here iz is an indicator for an event or a treatment (iz = 1). What I need is a variable that counts the periods before and after an event or the distance to and from an event. This variable would look like this:
id time iz nvar
1 a 1 0 -1
2 a 2 1 0
3 a 3 1 0
4 a 4 0 1
5 a 5 0 2
6 b 1 0 -4
7 b 2 0 -3
8 b 3 0 -2
9 b 4 0 -1
10 b 5 1 0
11 c 1 0 -1
12 c 2 0 -2
13 c 3 0 -3
14 c 4 1 0
15 c 5 1 0
I have tried working with the answers given here and here but can't make it work in my case.
I would really appreciate any ideas how to approach this problem. Thank you in advance for all ideas and suggestions.
1) rleid This code applies rleid from data.table to each id and then generates a negative reverse sequence if that produces a run of 1's and a forward sequence otherwise, i.e. we assume that a forward positive sequence should be used except before the first run of ones. For the 1's in iz zero that out. There can be any number of runs in an id and it also supports id's with only 0's or only 1's. It assumes that time has no gaps.
library(data.table)
Seq <- function(x, s = seq_along(x)) if (x[1] == 1) -rev(s) else s
nvar <- function(iz, r = rleid(iz)) ave((1-iz) * r, r, FUN = Seq)
transform(d, nvar = (1-iz) * ave(iz, id, FUN = nvar))
giving:
id time iz nvar
1 a 1 0 -1
2 a 2 1 0
3 a 3 1 0
4 a 4 0 1
5 a 5 0 2
6 b 1 0 -4
7 b 2 0 -3
8 b 3 0 -2
9 b 4 0 -1
10 b 5 1 0
11 c 1 0 -3
12 c 2 0 -2
13 c 3 0 -1
14 c 4 1 0
15 c 5 1 0
2) base This code uses only base R. It assumes that every id has at most one run of ones. There is no restriction on whether there are any zeros. Also it supports gaps in time. It applies nvar to the row numbers of each id. First it calculates the range rng of the times of the ones and then calculates the signed distance in the last line of nvar. The output is identical to that shown in (1). If we could assume that every id has exactly one run of 1's the if statement could be omitted.
nvar <- function(ix) with(d[ix, ], {
if (all(iz == 0)) return(iz)
rng <- range(time[iz == 1])
(time < rng[1]) * (time - rng[1]) + (time > rng[2]) * (time - rng[2])
})
transform(d, nvar = ave(1:nrow(d), id, FUN = nvar))
2a) This variation of (2) passes time and iz to nvar by encoding them as the real and imaginary parts of a complex vector in order to avoid having to deal with row numbers but it is otherwise the same as (2). We have omitted the if statement in (2) but it could be added back in if any id's have no ones.
nvar <- function(x, time = Re(x), iz = Im(x), rng = range(time[iz == 1]))
(time < rng[1]) * (time - rng[1]) + (time > rng[2]) * (time - rng[2])
transform(d, nvar = Re(ave(time + iz * 1i, id, FUN = nvar)))
Here is a solution that is a (tiny) bit more complex than the one from G.Grothendieck. But is will be able to handle non-sequential times.
library( data.table )
#make d a data.table
setDT(d)
#you can remove the trailing [], they are just for passing the output to the console...
#nvar = 0 where iz = 1
d[ iz == 1, nvar := 0 ][]
#calculate nvar for iz == 0 BEFORE iz == 1, using a forward rolling join
#create subsets for redability
d1 <- d[ iz == 1, ]
d0 <- d[ iz == 0, ]
d[ iz == 0, nvar := time - d1[ d0, x.time, on = .(id, time), roll = -Inf ] ][]
#calculate nvar for iz == 0 AFTER iz == 1, usning a backward rolling join
#create subsets for redability
d1 <- d[ iz == 1, ]
d0 <- d[ iz == 0 & is.na( nvar ), ]
d[ iz == 0 & is.na(nvar) , nvar := time - d1[ d0, x.time, on = .(id, time), roll = Inf ] ][]
# id time iz nvar
# 1: a 1 0 -1
# 2: a 2 1 0
# 3: a 3 1 0
# 4: a 4 0 1
# 5: a 5 0 2
# 6: b 1 0 -4
# 7: b 2 0 -3
# 8: b 3 0 -2
# 9: b 4 0 -1
# 10: b 5 1 0
# 11: c 1 0 -3
# 12: c 2 0 -2
# 13: c 3 0 -1
# 14: c 4 1 0
# 15: c 5 1 0
One dplyr and purrr option could be:
d %>%
group_by(id) %>%
mutate(nvar = map_dbl(.x = seq_along(iz), ~ min(abs(.x - which(iz == 1)))),
nvar = if_else(cumsum(iz) == 0, -nvar, nvar))
id time iz nvar
<fct> <dbl> <dbl> <dbl>
1 a 1 0 -1
2 a 2 1 0
3 a 3 1 0
4 a 4 0 1
5 a 5 0 2
6 b 1 0 -4
7 b 2 0 -3
8 b 3 0 -2
9 b 4 0 -1
10 b 5 1 0
11 c 1 0 -3
12 c 2 0 -2
13 c 3 0 -1
14 c 4 1 0
15 c 5 1 0

R: modify a variable conditioned on data from multiple previous rows

Hi I would really appreciate some help for this, I really couldn't find the solution in previous questions.
I have a tibble in long format (rows grouped by id and arranged by time).
I want to create a variable "eleg" based on "varx". The condition would be that "eleg" = 1 if "varx" in the previous 3 rows == 0 and in the current row varx == 1, if not = 0, for each ID. If possible using dplyr.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3)
time <- c(1,2,3,4,5,6,7,1,2,3,4,5,6,1,2,3,4)
varx <- c(0,0,0,0,1,1,0,0,1,1,1,1,1,0,0,0,1)
eleg <- c(0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1)
table <- data.frame(id, time, varx, eleg)
In my real dataset the condition is "in the previous 24 rows" and the same ID could have eleg == 1 more than one time if it suits the condition.
Thank you.
One of the approach could be
library(dplyr)
m <- 3 #number of times previous rows are looked back
df %>%
group_by(id) %>%
mutate(eleg = ifelse(rowSums(sapply(1:m, function(k) lag(varx, n = k, order_by = id, default = 1) == 0)) == m & varx == 1,
1,
0)) %>%
data.frame()
which gives
id time varx eleg
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 0 0
5 1 5 1 1
6 1 6 1 0
7 1 7 0 0
8 2 1 0 0
9 2 2 1 0
10 2 3 1 0
11 2 4 1 0
12 2 5 1 0
13 2 6 1 0
14 3 1 0 0
15 3 2 0 0
16 3 3 0 0
17 3 4 1 1
Sample data:
df <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3), time = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,
1, 2, 3, 4), varx = c(0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 1)), .Names = c("id", "time", "varx"), row.names = c(NA,
-17L), class = "data.frame")
library(data.table)
df %>%
mutate(elegnew = ifelse(Reduce("+", shift(df$varx, 1:3)) == 0 & df$varx == 1, 1, 0))
id time varx eleg elegnew
1 1 1 0 0 0
2 1 2 0 0 0
3 1 3 0 0 0
4 1 4 0 0 0
5 1 5 1 1 1
6 1 6 1 0 0
7 1 7 0 0 0
8 2 1 0 0 0
9 2 2 1 0 0
10 2 3 1 0 0
11 2 4 1 0 0
12 2 5 1 0 0
13 2 6 1 0 0
14 3 1 0 0 0
15 3 2 0 0 0
16 3 3 0 0 0
17 3 4 1 1 1
Here's another approach, using dplyr and zoo:
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(elegnew = as.integer(varx == 1 &
rollsum(varx == 1, k = 4, align = "right", fill = 0) == 1))
# # A tibble: 17 x 5
# # Groups: id [3]
# id time varx eleg elegnew
# <dbl> <dbl> <dbl> <dbl> <int>
# 1 1. 1. 0. 0. 0
# 2 1. 2. 0. 0. 0
# 3 1. 3. 0. 0. 0
# 4 1. 4. 0. 0. 0
# 5 1. 5. 1. 1. 1
# 6 1. 6. 1. 0. 0
# 7 1. 7. 0. 0. 0
# 8 2. 1. 0. 0. 0
# 9 2. 2. 1. 0. 0
# 10 2. 3. 1. 0. 0
# 11 2. 4. 1. 0. 0
# 12 2. 5. 1. 0. 0
# 13 2. 6. 1. 0. 0
# 14 3. 1. 0. 0. 0
# 15 3. 2. 0. 0. 0
# 16 3. 3. 0. 0. 0
# 17 3. 4. 1. 1. 1
The idea is to group by id and then check a) whether varx is 1 and b) whether the sum of varx=1 events in the previous 3 plus current row (k=4) is 1 (which means all previous 3 must be 0). I assume that varx is either 0 or 1.
You have asked for a dplyr solution, preferably.
The following is a base R one, with a function that you can adapt to "in the previous 24 rows", just pass n = 24 to the function.
fun <- function(DF, crit = "varx", new = "eleg", n = 3){
DF[[new]] <- 0
for(i in seq_len(nrow(DF))[-seq_len(n)]){
if(all(DF[[crit]][(i - n):(i - 1)] == 0) && DF[[crit]][i] == 1)
DF[[new]][i] <- 1
}
DF
}
sp <- split(table[-4], table[-4]$id)
new_df <- do.call(rbind, lapply(sp, fun))
row.names(new_df) <- NULL
identical(table, new_df)
#[1] TRUE
Note that if you are creating a new column, eleg, you would probably not need to split table[-4], just table since the 4th column wouldn't exist yet.
You could do do.call(rbind, lapply(sp, fun, n = 24)) and the rest would be the same.

Find the number of previous elements in vector until current value was equal or exceeded

For every element in a vector, I want to calculate the number of previous elements until I reach an element that is equal to or exceeds the value of the current element.
This is my vector:
CPC= c(25382.6, 30044.9, 22664.4, 30752.3, 21896.9, 24173.1, 29777.9, 9021.1, 8175.1, 9005.8, 5314.2, 4274.1, 3759.1, 5043.1, 5080.9, 6444.4, 6097.6, 8004.2, 6004.7, 6468.9, 5104.7, 5985.5, 8343.7, 8582, 7494.3, 6088.9, 4372.7, 4298.6, 4553.2, 5742)
I have tried something like this, which is not working
ROC = NULL #Create vector to store values
for (i in seq(1:length(CPC))
while(CPC[i])<CPC[i-1]
ROC[i] <- ifelse((CPC[i] < CPC[i-1]),1,0)
The output I am looking for is a vector (ROC) of same lenght as the original one (CPC), so that every element corresponds to the element on the same place in the original vector. ROC[i] would for CPC[i] give the number of previous elements in the CPC vector until the value is equal to or bigger than CPC[i]. For some elements, there are no pervious elements that has a higher or equal value and this would still need to be saved in the output, for example as a NA.
It would look like this:
ROC
NA,NA,1,NA,1,2,3,1,1,2... etc
prs <- unlist(lapply(1:(length(CPC)),
function(x) {
less_or_eq <- CPC[1 : x] <= CPC[x]
if(all(less_or_eq))
return(0)
inds <- which(less_or_eq == 0)
return(x - inds[length(inds)])
} ))
prs
# 0 0 1 0 1 2 3 1 1 2 1 1 1 3 4 6 1 8 1 2 1 2 13
# 14 1 1 1 1 3 4
Using for loop
res <- numeric(length(CPC))
for (i in 1 : length(CPC)) {
less_or_eq <- CPC[1 : i] <= CPC[i]
if (all(less_or_eq)) {
res[i] <- 0
} else {
inds <- which(less_or_eq == 0)
res[i] <- (i - inds[length(inds)])
}
}
res
#[1] 0 0 1 0 1 2 3 1 1 2 1 1 1 3 4 6 1 8 1 2 1 2 13
#[24] 14 1 1 1 1 3 4
Your output requirement is not very clear. The main part is to find where the difference (diff) changes to from negative to positive (meaning that the previous value is less than or equal to the current one. We create the groups based on that and count the length of each group, i.e.
tapply(CPC, cumsum(c(TRUE, diff(CPC) >= 0)), length)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# 1 2 2 1 3 4 1 1 2 2 2 1 1 5 1 1
#or take the cumulative sum of the above,
cumsum(tapply(CPC, cumsum(c(TRUE, diff(CPC) >= 0)), length))
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# 1 3 5 6 9 13 14 15 17 19 21 22 23 28 29 30

R - Finance back test function for entry / exit signal

Ok so I am back testing trading signals using R. Here is a snippet of my code which shows the z-score creation, close to close returns (using TTR package), the long signal and the Lag() to place the signal the next day.
require(quantmod)
require(TTR)
require(zoo)
# Calculate n period close price z-scores indicator using TTR package
new.df$roll.mean.n3 <- runMean(new.df$Close, n=3, cumulative = FALSE)
new.df$roll.sd.n3 <- runSD(new.df$Close, n=3, cumulative = FALSE)
new.df$roll.z.score.n3 <- apply(new.df[,c('Close','roll.mean.n3', 'roll.sd.n3')], 1, function(x) { (x[1]-x[2])/x[3] } )
# Calculate Close-to-Close returns
new.df$clret <- ROC(new.df$Close,1)
new.df$clret[1] <- 0
# Create the long (up) signal
new.df$sigup <- ifelse(new.df$roll.z.score.n3 < -1, 1, 0)
# Lag signals to align with days in market not days when signals were generated
new.df$sigup <- Lag(new.df$sigup,1) # Note k=1 implies a move *forward*
The current setup above produces an output like this :
roll.z.score.n3 sigup
1 NA 0
2 NA 0
3 -1.135974424 0
4 0.193311168 1
5 0.714285714 0
6 -1.148753543 0
7 -0.942160394 1
8 -0.695763683 0
9 1.140646864 0
10 0.985196899 0
11 -0.768766574 0
12 -1.011293858 0
13 -0.516703612 1
14 -1.120897077 0
15 1.091089451 1
The entry signal is to go long when zscore value is <-1 which is shown in row 3. We have a +1 on row 4 because we used Lag() to forward step the entry signal to the next day. Each time the z-score value is below -1, there is a +1 the next day.
This setup is perfectly fine if i'm only trading for 1 holding day only.
I can then multiply sigup 1 x % daily returns to obtain an equity curve.
I want to elaborate further on the entry / exit signals. I wish to go long (sig long) when zscore is <-1 and exit when z-score is >1.
The output would look something like this:
roll.z.score.n3 sig long
1 NA 0
2 NA 0
3 -1.135974424 0
4 0.193311168 1
5 0.714285714 1
6 -1.148753543 1
7 -0.942160394 1
8 -0.695763683 1
9 1.140646864 1
10 0.985196899 0
11 -0.768766574 0
12 -1.011293858 0
13 -0.516703612 1
14 -1.120897077 1
15 1.091089451 1
16 0.968364052 0
17 0.872871561 0
18 1.099524999 0
19 0.918397948 0
Row 3 shows a zscore signal of <-1. Lag next day makes it +1 (row 4). And it stays +1 all way until row 9 when z-score signal is >1.0. Thus, the next day at row 10, the signal is 0.
I wanted to give some background on the current coding, its an attempt to further the post at FOSS trading blog.
Thanks for taking a look at this.
See if the following works:
zz = '
roll.z.score.n3 sig_long
1 NA 0
2 NA 0
3 -1.135974424 0
4 0.193311168 1
5 0.714285714 1
6 -1.148753543 1
7 -0.942160394 1
8 -0.695763683 1
9 1.140646864 1
10 0.985196899 0
11 -0.768766574 0
12 -1.011293858 0
13 -0.516703612 1
14 -1.120897077 1
15 1.091089451 1
16 0.968364052 0
17 0.872871561 0
18 1.099524999 0
19 0.918397948 0
'
df <- read.table(text = zz, header = TRUE)
df = na.omit(df)
df$sig_long[[1]] = ifelse(df$roll.z.score.n3[[1]] < (-1), 1, 0)
for (i in 2:nrow(df)){
df$sig_long[i] = ifelse(df$roll.z.score.n3[i] < (-1), 1,
ifelse(df$roll.z.score.n3[i] > 1, 0,
df$sig_long[i-1]))
}
Not sure about this part:
df$sig_long <- Lag(df$sig_long, 1)

Increment call vector on itself in R - Finding counts in between values

I have a time series (or simply a vector) that is binary, returning 0 or 1's depending on some condition (generated with ifelse). I would like to be able to return the counts (in this case corresponds to time series, so days) in between the 1's.
I can do this very easily in Excel, by simply calling the Column I am trying to calculate and then adding the row above (if working with Ascending data, or calling row below if working with descending). See below
I tried doing something similar in R but I am getting an error.
DaysBetweenCondition1 = as.numeric(ifelse((Condition1 ==0 ),0,lag(DaysBetweenCondition1)+1))
Is there an easier way to do this besides making a function
Row# Date Condition1 DaysBetweenCondition1
1 5/2/2007 NA NA
2 5/3/2007 NA NA
3 5/4/2007 NA NA
4 5/5/2007 NA NA
5 5/6/2007 0 NA
6 5/7/2007 0 NA
7 5/8/2007 0 NA
8 5/9/2007 0 NA
9 5/10/2007 0 NA
10 5/11/2007 0 NA
11 5/12/2007 0 NA
12 5/13/2007 0 NA
13 5/14/2007 1 0
14 5/15/2007 0 1
15 5/16/2007 0 2
16 5/17/2007 0 3
17 5/18/2007 0 4
18 5/19/2007 0 5
19 5/20/2007 0 6
20 5/21/2007 0 7
21 5/22/2007 1 0
22 5/23/2007 0 1
23 5/24/2007 0 2
24 5/25/2007 0 3
25 5/26/2007 0 4
26 5/27/2007 1 0
27 5/28/2007 0 1
28 5/29/2007 0 2
29 5/30/2007 1 0
(fwiw, the Dates in this example are made up, in the real data I am using business days so a bit different, and I dont want to reference them, just put in for clarity)
This gets the counting done in one line. Borrowing PhiSeu's code and a line from How to reset cumsum at end of consecutive string and modifying it to count zeros:
# Example
df_date <- cbind.data.frame(c(1:20),
c(rep("18/08/2016",times=20)),
c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")
# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)
df_date$DaysBetweenCondition1<-sequence(rle(!df_date$Condition1)$lengths) * !df_date$Condition1
R is very good when working with rows that don't depend on each other. Therefore a lot of functions are vectorized. When working with functions that depend on the value of other rows it is not so easy.
At the moment I can only provide you with a solution using a loop. I assume there is a better solution without a loop.
# Example
df_date <- cbind.data.frame(c(1:20),
c(rep("18/08/2016",times=20)),
c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")
# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)
# loop over rows
for(i in 1:nrow(df_date)){
if(is.na(df_date$Condition1[i])) {
df_date$DaysBetweenCondition1[i] <- NA
} else if(df_date$Condition1[i]==0 & is.na(df_date$Condition1[i-1])) {
df_date$DaysBetweenCondition1[i] <- NA
} else if(df_date$Condition1[i]==0) {
df_date$DaysBetweenCondition1[i] <- df_date$DaysBetweenCondition1[i-1]+1
} else {
df_date$DaysBetweenCondition1[i] <- 0
}
}
Here's a solution that should be relatively fast
f0 = function(x) {
y = x # template for return value
isna = is.na(x) # used a couple of times
grp = cumsum(x[!isna]) # use '1' to mark start of each group
lag = lapply(tabulate(grp + 1), function(len) {
seq(0, length.out=len) # sequence from 0 to len-1
})
split(y[!isna], grp) <- lag # split y, set to lag element, unsplit
data.frame(x, y)
}
A faster version avoids the lapply() loop; it creates a vector along x (seq_along(x)) and an offset vector describing how the vector along x should be corrected based on the start value of the original vector
f1 = function(x0) {
y0 = x0
x = x0[!is.na(x0)]
y = seq_along(x)
offset = rep(c(1, y[x==1]), tabulate(cumsum(x) + 1))
y0[!is.na(y0)] = y - offset
data.frame(x0, y)
}
Walking through the first solution, here's some data
> set.seed(123)
> x = c(rep(NA, 5), rbinom(30, 1, .15))
> x
[1] NA NA NA NA NA 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1
[26] 1 0 0 1 0 0 0 0 0 0
use cumsum() to figure out the group the non-NA data belong to
> isna = is.na(x)
> grp = cumsum(x[!isna])
> grp
[1] 0 0 0 1 2 2 2 3 3 3 4 4 4 4 4 5 5 5 5 6 7 7 7 8 8 8 8 8 8 8
use tabulate() to figure out the number of elements in each group, lapply() to generate the relevant sequences
> lag = lapply(tabulate(grp + 1), function(len) seq(0, length.out=len))
finally, create a vector to hold the result, and use spilt<- to update with the lag
> y = x
> split(y[!isna], grp) <- lag
> data.frame(x, y)
x y
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 0 0
7 0 1
8 0 2
9 1 0
10 1 0
11 0 1
12 0 2
13 1 0
14 0 1
15 0 2
16 1 0
17 0 1
...
The key to the second solution is the calculation of the offset. The goal is to be able to 'correct' y = seq_along(x) by the value of y at the most recent 1 in x, kind of like 'fill down' in Excel. The starting values are c(1, y[x==1]) and each needs to be replicated by the number of elements in the group tabulate(cumsum(x) + 1).

Resources