Cumsum reset at certain values [duplicate] - r

This question already has answers here:
Cumsum with reset when 0 is encountered and by groups
(2 answers)
Cumulative sum that resets when 0 is encountered
(4 answers)
Closed 5 years ago.
I have the following dataframe
x y count
1 1 2018-02-24 4.031540
2 2 2018-02-25 5.244303
3 3 2018-02-26 5.441465
4 NA 2018-02-27 4.164104
5 5 2018-02-28 5.172919
6 6 2018-03-01 5.591410
7 7 2018-03-02 4.691716
8 8 2018-03-03 5.465360
9 9 2018-03-04 3.269378
10 NA 2018-03-05 5.300679
11 11 2018-03-06 5.489664
12 12 2018-03-07 4.423334
13 13 2018-03-08 3.808764
14 14 2018-03-09 6.450136
15 15 2018-03-10 5.541785
16 16 2018-03-11 4.762889
17 17 2018-03-12 5.511649
18 18 2018-03-13 6.795386
19 19 2018-03-14 6.615762
20 20 2018-03-15 4.749151
I want to take the cumsum of the count column, but I want the the cumsum to restart when the x value is NA. I've tried the following:
df$cum_sum <- ifelse(is.na(df$x) == FALSE, cumsum(df$count), 0)
x y count cum_sum
1 1 2018-02-24 4.031540 4.031540
2 2 2018-02-25 5.244303 9.275843
3 3 2018-02-26 5.441465 14.717308
4 NA 2018-02-27 4.164104 0.000000
5 5 2018-02-28 5.172919 24.054331
6 6 2018-03-01 5.591410 29.645741
7 7 2018-03-02 4.691716 34.337458
8 8 2018-03-03 5.465360 39.802817
9 9 2018-03-04 3.269378 43.072195
10 NA 2018-03-05 5.300679 0.000000
11 11 2018-03-06 5.489664 53.862538
12 12 2018-03-07 4.423334 58.285871
13 13 2018-03-08 3.808764 62.094635
14 14 2018-03-09 6.450136 68.544771
15 15 2018-03-10 5.541785 74.086556
16 16 2018-03-11 4.762889 78.849445
17 17 2018-03-12 5.511649 84.361094
18 18 2018-03-13 6.795386 91.156480
19 19 2018-03-14 6.615762 97.772242
20 20 2018-03-15 4.749151 102.521394
The result is the cum_sum column is 0 at the NA values, but the cumsum doesn't reset. How can I fix this?

A possible solution:
dat$cum_sum <- ave(dat$count, cumsum(is.na(dat$x)), FUN = cumsum)
which gives:
> dat
x y count cum_sum
1 1 2018-02-24 4.031540 4.031540
2 2 2018-02-25 5.244303 9.275843
3 3 2018-02-26 5.441465 14.717308
4 NA 2018-02-27 4.164104 4.164104
5 5 2018-02-28 5.172919 9.337023
6 6 2018-03-01 5.591410 14.928433
7 7 2018-03-02 4.691716 19.620149
8 8 2018-03-03 5.465360 25.085509
9 9 2018-03-04 3.269378 28.354887
10 NA 2018-03-05 5.300679 5.300679
11 11 2018-03-06 5.489664 10.790343
12 12 2018-03-07 4.423334 15.213677
13 13 2018-03-08 3.808764 19.022441
14 14 2018-03-09 6.450136 25.472577
15 15 2018-03-10 5.541785 31.014362
16 16 2018-03-11 4.762889 35.777251
17 17 2018-03-12 5.511649 41.288900
18 18 2018-03-13 6.795386 48.084286
19 19 2018-03-14 6.615762 54.700048
20 20 2018-03-15 4.749151 59.449199
Or with dplyr:
library(dplyr)
dat %>%
group_by(grp = cumsum(is.na(x))) %>%
mutate(cum_sum = cumsum(count)) %>%
ungroup() %>%
select(-grp)

I have the data.table version
plouf <- setDT(df)
plouf[,group := cumsum(is.na(x))]
plouf[!is.na(x),cum_sum := cumsum(count),by = group]
x y count group cum_sum
1: 1 2018-02-24 4.031540 0 4.031540
2: 2 2018-02-25 5.244303 0 9.275843
3: 3 2018-02-26 5.441465 0 14.717308
4: NA 2018-02-27 4.164104 1 NA
5: 5 2018-02-28 5.172919 1 5.172919
6: 6 2018-03-01 5.591410 1 10.764329
7: 7 2018-03-02 4.691716 1 15.456045
8: 8 2018-03-03 5.465360 1 20.921405
9: 9 2018-03-04 3.269378 1 24.190783
10: NA 2018-03-05 5.300679 2 NA
11: 11 2018-03-06 5.489664 2 5.489664
12: 12 2018-03-07 4.423334 2 9.912998
13: 13 2018-03-08 3.808764 2 13.721762
14: 14 2018-03-09 6.450136 2 20.171898
15: 15 2018-03-10 5.541785 2 25.713683
16: 16 2018-03-11 4.762889 2 30.476572
17: 17 2018-03-12 5.511649 2 35.988221
18: 18 2018-03-13 6.795386 2 42.783607
19: 19 2018-03-14 6.615762 2 49.399369
20: 20 2018-03-15 4.749151 2 54.148520

Related

Extract overlapping and non-overlapping time periods using R (data.table)

I have a dataset containing time periods during which an intervention is happening. We have two types of interventions. I have the start and end date of each intervention. I would now like to extract the time (in days) when there is no overlap between the two types and how much overlap there is.
Here's an example dataset:
data <- data.table( id = seq(1,21),
type = as.character(c(1,2,2,2,2,2,2,2,1,1,1,1,1,2,1,2,1,1,1,1,1)),
start_dt = as.Date(c("2015-01-09", "2015-04-14", "2015-06-19", "2015-10-30", "2016-03-01", "2016-05-24",
"2016-08-03", "2017-08-18", "2017-08-18", "2018-02-01", "2018-05-07", "2018-08-09",
"2019-01-31", "2019-03-22", "2019-05-16", "2019-11-04", "2019-11-04", "2020-02-06",
"2020-05-28", "2020-08-25", "2020-12-14")),
end_dt = as.Date(c("2017-07-24", "2015-05-04", "2015-08-27", "2015-11-19", "2016-03-21", "2016-06-09",
"2017-07-18", "2019-02-21", "2018-01-23", "2018-04-25", "2018-07-29", "2019-01-15",
"2019-04-24", "2019-09-13", "2019-10-13", "2020-12-23", "2020-01-26", "2020-04-29",
"2020-08-19", "2020-11-16", "2021-03-07")))
> data
id type start_dt end_dt
1: 1 1 2015-01-09 2017-07-24
2: 2 2 2015-04-14 2015-05-04
3: 3 2 2015-06-19 2015-08-27
4: 4 2 2015-10-30 2015-11-19
5: 5 2 2016-03-01 2016-03-21
6: 6 2 2016-05-24 2016-06-09
7: 7 2 2016-08-03 2017-07-18
8: 8 2 2017-08-18 2019-02-21
9: 9 1 2017-08-18 2018-01-23
10: 10 1 2018-02-01 2018-04-25
11: 11 1 2018-05-07 2018-07-29
12: 12 1 2018-08-09 2019-01-15
13: 13 1 2019-01-31 2019-04-24
14: 14 2 2019-03-22 2019-09-13
15: 15 1 2019-05-16 2019-10-13
16: 16 2 2019-11-04 2020-12-23
17: 17 1 2019-11-04 2020-01-26
18: 18 1 2020-02-06 2020-04-29
19: 19 1 2020-05-28 2020-08-19
20: 20 1 2020-08-25 2020-11-16
21: 21 1 2020-12-14 2021-03-07
Here's a plot of the data for a better view of what I want to know:
library(ggplot2)
ggplot(data = data,
aes(x = start_dt, xend = end_dt, y = id, yend = id, color = type)) +
geom_segment(size = 2) +
xlab("") +
ylab("") +
theme_bw()
I'll describe the first part of the example: we have an intervention of type 1 from 2015-01-09 until 2017-07-24. From 2015-04-14 however, also intervention type 2 is happening. This means that we only have "pure" type 1 from 2015-01-09 to 2015-04-13, which is 95 days.
Then we have an overlapping period from 2015-04-14 to 2015-05-04, which is 21 days. Then we again have a period with only type 1 from 2015-05-05 to 2015-06-18, which is 45 days. In total, we now have had (95 + 45 =) 140 days of "pure" type 1 and 21 days of overlap. Then we continue like this for the entire time period.
I would like to know the total time (in days) of "pure" type 1, "pure" type 2 and overlap.
Alternatively, if also possible, I would like to organise the data such, that I get all the seperate time periods extracted, meaning that the data would look something like this (type 3 = overlap):
> data_adjusted
id type start_dt end_dt
1: 1 1 2015-01-09 2015-04-14
2: 2 3 2015-04-15 2015-05-04
3: 3 1 2015-05-05 2015-06-18
4: 4 3 2015-06-19 2015-08-27
........
The time in days spent in each intervention type can then easily be calculated from data_adjuted.
I have similar answers using dplyr or just marking overlapping time periods, but I have not found an answer to my specific case.
Is there an efficient way to calculate this using data.table?
This method does a small explosion of looking at all dates in the range, so it may not scale very well if your data gets large.
library(data.table)
alldates <- data.table(date = seq(min(data$start_dt), max(data$end_dt), by = "day"))
data[alldates, on = .(start_dt <= date, end_dt >= date)] %>%
.[, .N, by = .(start_dt, type) ] %>%
.[ !is.na(type), ] %>%
dcast(start_dt ~ type, value.var = "N") %>%
.[, r := do.call(rleid, .SD), .SDcols = setdiff(colnames(.), "start_dt") ] %>%
.[, .(type = fcase(is.na(`1`[1]), "2", is.na(`2`[1]), "1", TRUE, "3"),
start_dt = min(start_dt), end_dt = max(start_dt)), by = r ]
# r type start_dt end_dt
# <int> <char> <Date> <Date>
# 1: 1 1 2015-01-09 2015-04-13
# 2: 2 3 2015-04-14 2015-05-04
# 3: 3 1 2015-05-05 2015-06-18
# 4: 4 3 2015-06-19 2015-08-27
# 5: 5 1 2015-08-28 2015-10-29
# 6: 6 3 2015-10-30 2015-11-19
# 7: 7 1 2015-11-20 2016-02-29
# 8: 8 3 2016-03-01 2016-03-21
# 9: 9 1 2016-03-22 2016-05-23
# 10: 10 3 2016-05-24 2016-06-09
# 11: 11 1 2016-06-10 2016-08-02
# 12: 12 3 2016-08-03 2017-07-18
# 13: 13 1 2017-07-19 2017-07-24
# 14: 14 3 2017-08-18 2018-01-23
# 15: 15 2 2018-01-24 2018-01-31
# 16: 16 3 2018-02-01 2018-04-25
# 17: 17 2 2018-04-26 2018-05-06
# 18: 18 3 2018-05-07 2018-07-29
# 19: 19 2 2018-07-30 2018-08-08
# 20: 20 3 2018-08-09 2019-01-15
# 21: 21 2 2019-01-16 2019-01-30
# 22: 22 3 2019-01-31 2019-02-21
# 23: 23 1 2019-02-22 2019-03-21
# 24: 24 3 2019-03-22 2019-04-24
# 25: 25 2 2019-04-25 2019-05-15
# 26: 26 3 2019-05-16 2019-09-13
# 27: 27 1 2019-09-14 2019-10-13
# 28: 28 3 2019-11-04 2020-01-26
# 29: 29 2 2020-01-27 2020-02-05
# 30: 30 3 2020-02-06 2020-04-29
# 31: 31 2 2020-04-30 2020-05-27
# 32: 32 3 2020-05-28 2020-08-19
# 33: 33 2 2020-08-20 2020-08-24
# 34: 34 3 2020-08-25 2020-11-16
# 35: 35 2 2020-11-17 2020-12-13
# 36: 36 3 2020-12-14 2020-12-23
# 37: 37 1 2020-12-24 2021-03-07
# r type start_dt end_dt
It drops the id field, I don't know how to map it well back to your original data.
#r2evans solution is more complete, but if you want to explore the use offoverlaps you can start with something like this:
#split into two frames
data = split(data,by="type")
# key the second frame
setkey(data[[2]], start_dt, end_dt)
# create the rows that have overlaps
overlap = foverlaps(data[[1]],data[[2]], type="any", nomatch=0)
# get the overlapping time periods
overlap[, .(start_dt = max(start_dt,i.start_dt), end_dt=min(end_dt,i.end_dt)), by=1:nrow(overlap)][,type:=3]
Output:
nrow start_dt end_dt type
1: 1 2015-04-14 2015-05-04 3
2: 2 2015-06-19 2015-08-27 3
3: 3 2015-10-30 2015-11-19 3
4: 4 2016-03-01 2016-03-21 3
5: 5 2016-05-24 2016-06-09 3
6: 6 2016-08-03 2017-07-18 3
7: 7 2017-08-18 2018-01-23 3
8: 8 2018-02-01 2018-04-25 3
9: 9 2018-05-07 2018-07-29 3
10: 10 2018-08-09 2019-01-15 3
11: 11 2019-01-31 2019-02-21 3
12: 12 2019-03-22 2019-04-24 3
13: 13 2019-05-16 2019-09-13 3
14: 14 2019-11-04 2020-01-26 3
15: 15 2020-02-06 2020-04-29 3
16: 16 2020-05-28 2020-08-19 3
17: 17 2020-08-25 2020-11-16 3
18: 18 2020-12-14 2020-12-23 3
The sum of those overlap days is 1492.

Define periods/episodes of exposition with overlaping and concatenated intervals of time

I'm trying to identify periods/episodes of exposition to a drug with prescriptions. If those prescriptions are separated for 30 days it's considered a new period/episode of exposition. Prescriptions can overlap during certain time or be consecutive. If the sum of separated days of two consecutive prescripction is greater than 30 days it's not considered a new episode.
I have data like this:
id = c(rep(1,3), rep(2,6), rep(3,5))
start = as.Date(c("2017-05-10", "2017-07-28", "2017-11-23", "2017-01-27", "2017-10-02", "2018-05-14", "2018-05-25", "2018-11-26", "2018-12-28", "2016-01-01", "2016-03-02", "2016-03-20", "2016-04-25", "2016-06-29"))
end = as.Date(c("2017-07-27", "2018-01-28", "2018-03-03", "2017-04-27", "2018-05-13", "2018-11-14", "2018-11-25", "2018-12-27", "2019-06-28", "2016-02-15", "2016-03-05", "2016-03-24", "2016-04-29", "2016-11-01"))
DT = data.table(id, start, end)
DT
id start end
1: 1 2017-05-10 2017-07-27
2: 1 2017-07-28 2018-01-28
3: 1 2017-11-23 2018-03-03
4: 2 2017-01-27 2017-04-27
5: 2 2017-10-02 2018-05-13
6: 2 2018-05-14 2018-11-14
7: 2 2018-05-25 2018-11-25
8: 2 2018-11-26 2018-12-27
9: 2 2018-12-28 2019-06-28
10: 3 2016-01-01 2016-02-15
11: 3 2016-03-02 2016-03-05
12: 3 2016-03-20 2016-03-24
13: 3 2016-04-25 2016-04-29
14: 3 2016-06-29 2016-11-01
I calculated the difference of start and last end observation (last_diffdays)
DT[, last_diffdays := start-shift(end, n=1L), by = .(id)][is.na(last_diffdays), last_diffdays := 0][]
id start end last_diffdays
1: 1 2017-05-10 2017-07-27 0 days
2: 1 2017-07-28 2018-01-28 1 days
3: 1 2017-11-23 2018-03-03 -66 days
4: 2 2017-01-27 2017-04-27 0 days
5: 2 2017-10-02 2018-05-13 158 days
6: 2 2018-05-14 2018-11-14 1 days
7: 2 2018-05-25 2018-11-25 -173 days
8: 2 2018-11-26 2018-12-27 1 days
9: 2 2018-12-28 2019-06-28 1 days
10: 3 2016-01-01 2016-02-15 0 days
11: 3 2016-03-02 2016-03-05 16 days
12: 3 2016-03-20 2016-03-24 15 days
13: 3 2016-04-25 2016-04-29 32 days
14: 3 2016-06-29 2016-11-01 61 days
This shows when an overlap happens (negative values) or not (positive values). I think an ifelse/fcase statement here would be a bad idea and I'm not comfortable doing it.
I think a good output for this job would be something like:
id start end last_diffdays noexp_days period
1: 1 2017-05-10 2017-07-27 0 days 0 1
2: 1 2017-07-28 2018-01-28 1 days 1 1
3: 1 2017-11-23 2018-03-03 -66 days 0 1
4: 2 2017-01-27 2017-04-27 0 days 0 1
5: 2 2017-10-02 2018-05-13 158 days 158 2
6: 2 2018-05-14 2018-11-14 1 days 1 2
7: 2 2018-05-25 2018-11-25 -173 days 0 2
8: 2 2018-11-26 2018-12-27 1 days 1 2
9: 2 2018-12-28 2019-06-28 1 days 1 2
10: 3 2016-01-01 2016-02-15 0 days 0 1
11: 3 2016-03-02 2016-03-05 16 days 16 1
12: 3 2016-03-20 2016-03-24 15 days 15 1
13: 3 2016-04-25 2016-04-29 32 days 32 2
14: 3 2016-06-29 2016-11-01 61 days 61 3
I manually calculated the days without exposition (noexp_days) of the before prescription.
I dunno If I'm the right path but I think I need to calculate noexp_days variable and then make a cumsum((noexp_days)>30)+1.
If there is a much better solution I don't see or any other possibility I haven't considered I will appreciate to read about them.
Thanks in advance for any help! :)
Try :
library(data.table)
DT[, noexp_days := pmax(as.integer(last_diffdays), 0)]
DT[, period := cumsum(noexp_days > 30) + 1, id]
DT
# id start end last_diffdays noexp_days period
# 1: 1 2017-05-10 2017-07-27 0 days 0 1
# 2: 1 2017-07-28 2018-01-28 1 days 1 1
# 3: 1 2017-11-23 2018-03-03 -66 days 0 1
# 4: 2 2017-01-27 2017-04-27 0 days 0 1
# 5: 2 2017-10-02 2018-05-13 158 days 158 2
# 6: 2 2018-05-14 2018-11-14 1 days 1 2
# 7: 2 2018-05-25 2018-11-25 -173 days 0 2
# 8: 2 2018-11-26 2018-12-27 1 days 1 2
# 9: 2 2018-12-28 2019-06-28 1 days 1 2
#10: 3 2016-01-01 2016-02-15 0 days 0 1
#11: 3 2016-03-02 2016-03-05 16 days 16 1
#12: 3 2016-03-20 2016-03-24 15 days 15 1
#13: 3 2016-04-25 2016-04-29 32 days 32 2
#14: 3 2016-06-29 2016-11-01 61 days 61 3

What is the best way to use dplyr/dtplyr to make complex updates to a data.table

We have written a package to analyse a large number of events in relation to time windows.
To do the analysis we need to establish a number of attributes of the windows and cross-references
between them.
This has been done using data.table in its native syntax. Examples of some of the steps is included in the reprex below.
We are now looking to re-frame this package using dplyr/dtplyr for readability and sharing with other
parties.
While I can write the 'queries' in dplyr syntax, I am not seeing a tidyverse way to apply updates to the underlying tables - adding columns, updating rows etc. without repeatedly creating and replacing copies.
When the data is large, the 'update in place' features of data.table are very desirable. Is there a way to take advantage of this in the dplyr syntax? (I have hit barriers with immutable = FALSE and attempts to use rows_update())
library(data.table)
set.seed <- 123
#Create a table of events with timestamp and an event type (501 events randomly generated over the previous 30 days)
DT1 <- data.table(timeStamp = as.POSIXct('2021-03-25') - as.integer(runif(501)*60*1440*30),
eventType=c('A', 'B', 'C'))
setkey(DT1, timeStamp)
print(DT1)
#> timeStamp eventType
#> 1: 2021-02-23 00:42:37 A
#> 2: 2021-02-23 04:21:43 A
#> 3: 2021-02-23 05:23:51 C
#> 4: 2021-02-23 06:45:36 C
#> 5: 2021-02-23 08:34:32 B
#> ---
#> 497: 2021-03-24 11:32:09 A
#> 498: 2021-03-24 13:49:53 B
#> 499: 2021-03-24 14:26:55 C
#> 500: 2021-03-24 18:11:33 C
#> 501: 2021-03-24 20:13:51 A
#Create a table of time windows. One for each date represented with an early and late time for each
#Assign this a class (in this example the value of the most common eventType)
DT2 <- DT1[,keyby=.(date=lubridate::date(timeStamp)),
.(earlyTime = min(timeStamp - 1),
lateTime = max(timeStamp + 1),
as = sum(eventType == 'A'),
bs = sum(eventType == 'B'),
cs = sum(eventType == 'C'))][
,.(date,
earlyTime,
lateTime,
class=ifelse(as >= bs & as >= cs, 'A', ifelse(bs >= cs, 'B', 'C')))]
print(head(DT2))
#> date earlyTime lateTime class
#> 1: 2021-02-23 2021-02-23 00:42:36 2021-02-23 23:14:13 B
#> 2: 2021-02-24 2021-02-24 04:10:27 2021-02-24 21:28:14 B
#> 3: 2021-02-25 2021-02-25 03:38:29 2021-02-25 21:55:44 A
#> 4: 2021-02-26 2021-02-26 01:49:00 2021-02-26 23:40:51 B
#> 5: 2021-02-27 2021-02-27 00:18:40 2021-02-27 22:42:46 A
#> 6: 2021-02-28 2021-02-28 02:50:25 2021-02-28 22:44:44 A
#Give each row in DT2 a row number (so that we can readily cross-reference between rows)
DT2[order(lateTime), rn := .I]
#For each row, get the row number of the previous instance of this class
DT2[order(class, rn), prevOfClass := shift(rn, 1), by=.(class)]
print(head(DT2))
#> date earlyTime lateTime class rn prevOfClass
#> 1: 2021-02-23 2021-02-23 00:42:36 2021-02-23 23:14:13 B 1 NA
#> 2: 2021-02-24 2021-02-24 04:10:27 2021-02-24 21:28:14 B 2 1
#> 3: 2021-02-25 2021-02-25 03:38:29 2021-02-25 21:55:44 A 3 NA
#> 4: 2021-02-26 2021-02-26 01:49:00 2021-02-26 23:40:51 B 4 2
#> 5: 2021-02-27 2021-02-27 00:18:40 2021-02-27 22:42:46 A 5 3
#> 6: 2021-02-28 2021-02-28 02:50:25 2021-02-28 22:44:44 A 6 5
#For each row that is not a 'C' find the previous and next instances of a C type row
#Note that when we assigned rn we ensured that the rows were in ascending time order
#so rn can be used as a proxy for sorting by time
DT2[class=='C'][DT2[class != 'C'],
on=.(rn > rn),
by=.EACHI,
.(rn=i.rn, nextC = min(x.rn), prevC = min(x.prevOfClass))]
#> rn rn nextC prevC
#> 1: 1 1 8 NA
#> 2: 2 2 8 NA
#> 3: 3 3 8 NA
#> 4: 4 4 8 NA
#> 5: 5 5 8 NA
#> 6: 6 6 8 NA
#> 7: 7 7 8 NA
#> 8: 9 9 13 8
#> 9: 10 10 13 8
#> 10: 11 11 13 8
#> 11: 12 12 13 8
#> 12: 14 14 16 13
#> 13: 15 15 16 13
#> 14: 17 17 26 16
#> 15: 18 18 26 16
#> 16: 19 19 26 16
#> 17: 20 20 26 16
#> 18: 21 21 26 16
#> 19: 22 22 26 16
#> 20: 23 23 26 16
#> 21: 24 24 26 16
#> 22: 25 25 26 16
#> 23: 28 28 30 27
#> 24: 29 29 30 27
#> rn rn nextC prevC
#But I want to add this information as additional columns to the base table
DT2[DT2[class=='C'][DT2[class != 'C'],
on=.(rn > rn),
by=.EACHI,
.(rn=i.rn, nextC = min(x.rn), prevC = min(x.prevOfClass))],
on = .(rn),
':='(nextC=i.nextC, prevC = i.prevC)
]
print(DT2[,.(rn, date, class, prevOfClass, nextC, prevC)])
#> rn date class prevOfClass nextC prevC
#> 1: 1 2021-02-23 B NA 8 NA
#> 2: 2 2021-02-24 B 1 8 NA
#> 3: 3 2021-02-25 A NA 8 NA
#> 4: 4 2021-02-26 B 2 8 NA
#> 5: 5 2021-02-27 A 3 8 NA
#> 6: 6 2021-02-28 A 5 8 NA
#> 7: 7 2021-03-01 A 6 8 NA
#> 8: 8 2021-03-02 C NA NA NA
#> 9: 9 2021-03-03 A 7 13 8
#> 10: 10 2021-03-04 A 9 13 8
#> 11: 11 2021-03-05 B 4 13 8
#> 12: 12 2021-03-06 A 10 13 8
#> 13: 13 2021-03-07 C 8 NA NA
#> 14: 14 2021-03-08 A 12 16 13
#> 15: 15 2021-03-09 B 11 16 13
#> 16: 16 2021-03-10 C 13 NA NA
#> 17: 17 2021-03-11 A 14 26 16
#> 18: 18 2021-03-12 B 15 26 16
#> 19: 19 2021-03-13 A 17 26 16
#> 20: 20 2021-03-14 B 18 26 16
#> 21: 21 2021-03-15 A 19 26 16
#> 22: 22 2021-03-16 A 21 26 16
#> 23: 23 2021-03-17 A 22 26 16
#> 24: 24 2021-03-18 A 23 26 16
#> 25: 25 2021-03-19 B 20 26 16
#> 26: 26 2021-03-20 C 16 NA NA
#> 27: 27 2021-03-21 C 26 NA NA
#> 28: 28 2021-03-22 B 25 30 27
#> 29: 29 2021-03-23 A 24 30 27
#> 30: 30 2021-03-24 C 27 NA NA
#> rn date class prevOfClass nextC prevC
#What would be the best approach to this using dplyr / dtplyr syntax?
#In practice there are many hundreds of thousands of rows in the tables
#and...
#There are many more update and enrichments that need to be applied
#some of which add new columns, others will update just a few rows
#in a column
#So 'mutate in place/by reference' is highly desirable
Created on 2021-03-25 by the reprex package (v1.0.0)

Growth Rate for daily data

I have a data for selling some product and I would like to calculate the growth rate of this data such that N_win and N_lose are the win and lose over a period of time 1-19 March. Also, I would like to predict the growth rate and win and lose?
Date N_win N_lose tot1 tot2
1 2018-03-01 0 0 0 0
2 2018-03-02 1 0 1 1
3 2018-03-03 0 0 1 1
4 2018-03-04 1 0 2 2
5 2018-03-05 3 0 5 5
6 2018-03-06 0 0 5 5
7 2018-03-07 2 0 7 7
8 2018-03-08 4 0 11 11
9 2018-03-09 4 0 15 15
10 2018-03-10 5 0 20 20
11 2018-03-11 1 1 21 20
12 2018-03-12 24 1 45 44
13 2018-03-13 41 1 86 85
14 2018-03-14 17 2 103 101
15 2018-03-15 15 3 118 115
16 2018-03-16 15 6 133 127
17 2018-03-17 38 6 171 165
18 2018-03-18 67 6 238 232
I tried to apply this function but it seems not working
Growthrate = function(x1,x2, n){
gr = (x2/x1)^(1/n)-1
return(gr)
}
GR = NULL
for(i in 1:length(DF[,1])){
GR[i] = Growthrate(DF[i,2],DF[i+1,2], sum(i))
}

Assign a date to specific text in R

I have a data frame in R as follows called Ident.
Date coredata.Ident.
1 2017-09-01 <NA>
2 2017-09-03 <NA>
3 2017-09-04 <NA>
4 2017-09-05 0
5 2017-09-06 0
6 2017-09-07 0
7 2017-09-08 0
8 2017-09-10 0
9 2017-09-11 Doji
10 2017-09-12 <NA>
11 2017-09-13 0
12 2017-09-14 Bull.Engulfing
13 2017-09-15 0
14 2017-09-17 0
15 2017-09-18 Bear.Engulfing
16 2017-09-19 Doji
17 2017-09-20 Bear.Engulfing
18 2017-09-21 Bull.Engulfing
19 2017-09-22 0
20 2017-09-24 0
21 2017-09-25 Bear.Engulfing
22 2017-09-26 0
23 2017-09-27 0
24 2017-09-28 0
25 2017-09-29 0
I would like to assign the next date after that there is a Bull.Engulfing to a variable called DateSelect1 and then the second Bull.Engulfing would be assigned to DateSelect2 etc. So that all of the Bull.Engulfing have a date assigned to them.
So in this example, as there is a Bull.Engulfing on 2017-09-14 line 12, DateSelect1 should be 2017-09-15 as it is the next row. Hope this makes sense.
TIA
Assuming the input in the Note subset the data frame and assign each date in that subset to a variable:
dates <- as.Date(subset(DF, coredata.Ident. == "Bull.Engulfing")$Date)
for(i in seq_along(dates)) assign(paste0("DateSelect", i), dates[i])
DateSelect1
## [1] "2017-09-14"
DateSelect2
## [1] "2017-09-21"
Note: The input in reproducible form is:
Lines <- "
Date coredata.Ident.
1 2017-09-01 <NA>
2 2017-09-03 <NA>
3 2017-09-04 <NA>
4 2017-09-05 0
5 2017-09-06 0
6 2017-09-07 0
7 2017-09-08 0
8 2017-09-10 0
9 2017-09-11 Doji
10 2017-09-12 <NA>
11 2017-09-13 0
12 2017-09-14 Bull.Engulfing
13 2017-09-15 0
14 2017-09-17 0
15 2017-09-18 Bear.Engulfing
16 2017-09-19 Doji
17 2017-09-20 Bear.Engulfing
18 2017-09-21 Bull.Engulfing
19 2017-09-22 0
20 2017-09-24 0
21 2017-09-25 Bear.Engulfing
22 2017-09-26 0
23 2017-09-27 0
24 2017-09-28 0
25 2017-09-29 0"
DF <- read.table(text = Lines)

Resources