Assign a date to specific text in R - r

I have a data frame in R as follows called Ident.
Date coredata.Ident.
1 2017-09-01 <NA>
2 2017-09-03 <NA>
3 2017-09-04 <NA>
4 2017-09-05 0
5 2017-09-06 0
6 2017-09-07 0
7 2017-09-08 0
8 2017-09-10 0
9 2017-09-11 Doji
10 2017-09-12 <NA>
11 2017-09-13 0
12 2017-09-14 Bull.Engulfing
13 2017-09-15 0
14 2017-09-17 0
15 2017-09-18 Bear.Engulfing
16 2017-09-19 Doji
17 2017-09-20 Bear.Engulfing
18 2017-09-21 Bull.Engulfing
19 2017-09-22 0
20 2017-09-24 0
21 2017-09-25 Bear.Engulfing
22 2017-09-26 0
23 2017-09-27 0
24 2017-09-28 0
25 2017-09-29 0
I would like to assign the next date after that there is a Bull.Engulfing to a variable called DateSelect1 and then the second Bull.Engulfing would be assigned to DateSelect2 etc. So that all of the Bull.Engulfing have a date assigned to them.
So in this example, as there is a Bull.Engulfing on 2017-09-14 line 12, DateSelect1 should be 2017-09-15 as it is the next row. Hope this makes sense.
TIA

Assuming the input in the Note subset the data frame and assign each date in that subset to a variable:
dates <- as.Date(subset(DF, coredata.Ident. == "Bull.Engulfing")$Date)
for(i in seq_along(dates)) assign(paste0("DateSelect", i), dates[i])
DateSelect1
## [1] "2017-09-14"
DateSelect2
## [1] "2017-09-21"
Note: The input in reproducible form is:
Lines <- "
Date coredata.Ident.
1 2017-09-01 <NA>
2 2017-09-03 <NA>
3 2017-09-04 <NA>
4 2017-09-05 0
5 2017-09-06 0
6 2017-09-07 0
7 2017-09-08 0
8 2017-09-10 0
9 2017-09-11 Doji
10 2017-09-12 <NA>
11 2017-09-13 0
12 2017-09-14 Bull.Engulfing
13 2017-09-15 0
14 2017-09-17 0
15 2017-09-18 Bear.Engulfing
16 2017-09-19 Doji
17 2017-09-20 Bear.Engulfing
18 2017-09-21 Bull.Engulfing
19 2017-09-22 0
20 2017-09-24 0
21 2017-09-25 Bear.Engulfing
22 2017-09-26 0
23 2017-09-27 0
24 2017-09-28 0
25 2017-09-29 0"
DF <- read.table(text = Lines)

Related

Define periods/episodes of exposition with overlaping and concatenated intervals of time

I'm trying to identify periods/episodes of exposition to a drug with prescriptions. If those prescriptions are separated for 30 days it's considered a new period/episode of exposition. Prescriptions can overlap during certain time or be consecutive. If the sum of separated days of two consecutive prescripction is greater than 30 days it's not considered a new episode.
I have data like this:
id = c(rep(1,3), rep(2,6), rep(3,5))
start = as.Date(c("2017-05-10", "2017-07-28", "2017-11-23", "2017-01-27", "2017-10-02", "2018-05-14", "2018-05-25", "2018-11-26", "2018-12-28", "2016-01-01", "2016-03-02", "2016-03-20", "2016-04-25", "2016-06-29"))
end = as.Date(c("2017-07-27", "2018-01-28", "2018-03-03", "2017-04-27", "2018-05-13", "2018-11-14", "2018-11-25", "2018-12-27", "2019-06-28", "2016-02-15", "2016-03-05", "2016-03-24", "2016-04-29", "2016-11-01"))
DT = data.table(id, start, end)
DT
id start end
1: 1 2017-05-10 2017-07-27
2: 1 2017-07-28 2018-01-28
3: 1 2017-11-23 2018-03-03
4: 2 2017-01-27 2017-04-27
5: 2 2017-10-02 2018-05-13
6: 2 2018-05-14 2018-11-14
7: 2 2018-05-25 2018-11-25
8: 2 2018-11-26 2018-12-27
9: 2 2018-12-28 2019-06-28
10: 3 2016-01-01 2016-02-15
11: 3 2016-03-02 2016-03-05
12: 3 2016-03-20 2016-03-24
13: 3 2016-04-25 2016-04-29
14: 3 2016-06-29 2016-11-01
I calculated the difference of start and last end observation (last_diffdays)
DT[, last_diffdays := start-shift(end, n=1L), by = .(id)][is.na(last_diffdays), last_diffdays := 0][]
id start end last_diffdays
1: 1 2017-05-10 2017-07-27 0 days
2: 1 2017-07-28 2018-01-28 1 days
3: 1 2017-11-23 2018-03-03 -66 days
4: 2 2017-01-27 2017-04-27 0 days
5: 2 2017-10-02 2018-05-13 158 days
6: 2 2018-05-14 2018-11-14 1 days
7: 2 2018-05-25 2018-11-25 -173 days
8: 2 2018-11-26 2018-12-27 1 days
9: 2 2018-12-28 2019-06-28 1 days
10: 3 2016-01-01 2016-02-15 0 days
11: 3 2016-03-02 2016-03-05 16 days
12: 3 2016-03-20 2016-03-24 15 days
13: 3 2016-04-25 2016-04-29 32 days
14: 3 2016-06-29 2016-11-01 61 days
This shows when an overlap happens (negative values) or not (positive values). I think an ifelse/fcase statement here would be a bad idea and I'm not comfortable doing it.
I think a good output for this job would be something like:
id start end last_diffdays noexp_days period
1: 1 2017-05-10 2017-07-27 0 days 0 1
2: 1 2017-07-28 2018-01-28 1 days 1 1
3: 1 2017-11-23 2018-03-03 -66 days 0 1
4: 2 2017-01-27 2017-04-27 0 days 0 1
5: 2 2017-10-02 2018-05-13 158 days 158 2
6: 2 2018-05-14 2018-11-14 1 days 1 2
7: 2 2018-05-25 2018-11-25 -173 days 0 2
8: 2 2018-11-26 2018-12-27 1 days 1 2
9: 2 2018-12-28 2019-06-28 1 days 1 2
10: 3 2016-01-01 2016-02-15 0 days 0 1
11: 3 2016-03-02 2016-03-05 16 days 16 1
12: 3 2016-03-20 2016-03-24 15 days 15 1
13: 3 2016-04-25 2016-04-29 32 days 32 2
14: 3 2016-06-29 2016-11-01 61 days 61 3
I manually calculated the days without exposition (noexp_days) of the before prescription.
I dunno If I'm the right path but I think I need to calculate noexp_days variable and then make a cumsum((noexp_days)>30)+1.
If there is a much better solution I don't see or any other possibility I haven't considered I will appreciate to read about them.
Thanks in advance for any help! :)
Try :
library(data.table)
DT[, noexp_days := pmax(as.integer(last_diffdays), 0)]
DT[, period := cumsum(noexp_days > 30) + 1, id]
DT
# id start end last_diffdays noexp_days period
# 1: 1 2017-05-10 2017-07-27 0 days 0 1
# 2: 1 2017-07-28 2018-01-28 1 days 1 1
# 3: 1 2017-11-23 2018-03-03 -66 days 0 1
# 4: 2 2017-01-27 2017-04-27 0 days 0 1
# 5: 2 2017-10-02 2018-05-13 158 days 158 2
# 6: 2 2018-05-14 2018-11-14 1 days 1 2
# 7: 2 2018-05-25 2018-11-25 -173 days 0 2
# 8: 2 2018-11-26 2018-12-27 1 days 1 2
# 9: 2 2018-12-28 2019-06-28 1 days 1 2
#10: 3 2016-01-01 2016-02-15 0 days 0 1
#11: 3 2016-03-02 2016-03-05 16 days 16 1
#12: 3 2016-03-20 2016-03-24 15 days 15 1
#13: 3 2016-04-25 2016-04-29 32 days 32 2
#14: 3 2016-06-29 2016-11-01 61 days 61 3

Growth Rate for daily data

I have a data for selling some product and I would like to calculate the growth rate of this data such that N_win and N_lose are the win and lose over a period of time 1-19 March. Also, I would like to predict the growth rate and win and lose?
Date N_win N_lose tot1 tot2
1 2018-03-01 0 0 0 0
2 2018-03-02 1 0 1 1
3 2018-03-03 0 0 1 1
4 2018-03-04 1 0 2 2
5 2018-03-05 3 0 5 5
6 2018-03-06 0 0 5 5
7 2018-03-07 2 0 7 7
8 2018-03-08 4 0 11 11
9 2018-03-09 4 0 15 15
10 2018-03-10 5 0 20 20
11 2018-03-11 1 1 21 20
12 2018-03-12 24 1 45 44
13 2018-03-13 41 1 86 85
14 2018-03-14 17 2 103 101
15 2018-03-15 15 3 118 115
16 2018-03-16 15 6 133 127
17 2018-03-17 38 6 171 165
18 2018-03-18 67 6 238 232
I tried to apply this function but it seems not working
Growthrate = function(x1,x2, n){
gr = (x2/x1)^(1/n)-1
return(gr)
}
GR = NULL
for(i in 1:length(DF[,1])){
GR[i] = Growthrate(DF[i,2],DF[i+1,2], sum(i))
}

R code to get max count of time series data by group

I'd like to get a summary of time series data where group is "Flare" and the max value of the FlareLength is the data of interest for that group.
If I have a dataframe, like this:
Date Flare FlareLength
1 2015-12-01 0 1
2 2015-12-02 0 2
3 2015-12-03 0 3
4 2015-12-04 0 4
5 2015-12-05 0 5
6 2015-12-06 0 6
7 2015-12-07 1 1
8 2015-12-08 1 2
9 2015-12-09 1 3
10 2015-12-10 1 4
11 2015-12-11 0 1
12 2015-12-12 0 2
13 2015-12-13 0 3
14 2015-12-14 0 4
15 2015-12-15 0 5
16 2015-12-16 0 6
17 2015-12-17 0 7
18 2015-12-18 0 8
19 2015-12-19 0 9
20 2015-12-20 0 10
21 2015-12-21 0 11
22 2016-01-11 1 1
23 2016-01-12 1 2
24 2016-01-13 1 3
25 2016-01-14 1 4
26 2016-01-15 1 5
27 2016-01-16 1 6
28 2016-01-17 1 7
29 2016-01-18 1 8
I'd like output like:
Date Flare FlareLength
1 2015-12-06 0 6
2 2015-12-10 1 4
3 2015-12-21 0 11
4 2016-01-18 1 8
I have tried various aggregate forms but I'm not very familiar with the time series wrinkle.
Using dplyr, we can create a grouping variable by comparing the FlareLength with the previous FlareLength value and select the row with maximum FlareLength in the group.
library(dplyr)
df %>%
group_by(gr = cumsum(FlareLength < lag(FlareLength,
default = first(FlareLength)))) %>%
slice(which.max(FlareLength)) %>%
ungroup() %>%
select(-gr)
# A tibble: 4 x 3
# Date Flare FlareLength
# <fct> <int> <int>
#1 2015-12-06 0 6
#2 2015-12-10 1 4
#3 2015-12-21 0 11
#4 2016-01-18 1 8
In base R with ave we can do the same as
subset(df, FlareLength == ave(FlareLength, cumsum(c(TRUE, diff(FlareLength) < 0)),
FUN = max))

How can I receive from a column data points with the highest Information (Gain)?

Suppose I have this dataframe:
> df1
date count
1 2012-07-01 2.867133
2 2012-08-01 2.018745
3 2012-09-01 5.237515
4 2012-10-01 8.320493
5 2012-11-01 4.119850
6 2012-12-01 3.648649
7 2013-01-01 3.172867
8 2013-02-01 4.065041
9 2013-03-01 2.914798
10 2013-04-01 4.735683
11 2013-05-01 3.775411
12 2013-06-01 3.825717
13 2013-07-01 3.273427
14 2013-08-01 2.716469
15 2013-09-01 2.687296
16 2013-10-01 3.674121
17 2013-11-01 3.325942
18 2013-12-01 2.524038
I now want to split df1$count in such a way, that I get groups/ranges of where the Information is the highest. My thoughts go towards Information Gain, but I know IG is for attributes, not a column.
If you plot the data, you can distinguish a high rise and decrease...so my goal is to always find these significant increases/decreases which contain a high Information Gain.
Any ideas on how I could do this?
Something like this?
df1%>%
mutate(dif=ifelse((lag(count)-count)>0,0,1))%>%
mutate(group=rle(dif) %>% magrittr::extract2("lengths") %>% rep(seq_along(.), .))
date count dif group
1 2012-07-01 2.867133 NA 1
2 2012-08-01 2.018745 0 2
3 2012-09-01 5.237515 1 3
4 2012-10-01 8.320493 1 3
5 2012-11-01 4.119850 0 4
6 2012-12-01 3.648649 0 4
7 2013-01-01 3.172867 0 4
8 2013-02-01 4.065041 1 5
9 2013-03-01 2.914798 0 6
10 2013-04-01 4.735683 1 7
11 2013-05-01 3.775411 0 8
12 2013-06-01 3.825717 1 9
13 2013-07-01 3.273427 0 10
14 2013-08-01 2.716469 0 10
15 2013-09-01 2.687296 0 10
16 2013-10-01 3.674121 1 11
17 2013-11-01 3.325942 0 12
18 2013-12-01 2.524038 0 12
UPDATE
df1%>%
mutate(nxt=lag(count),
dif=ifelse( abs(count-lag(count))>2 | count/lag(count)>3 | lag(count)/count>3,1,0))%>%
+ mutate(group=rle(dif) %>% magrittr::extract2("lengths") %>% rep(seq_along(.), .))
date count nxt dif group
1 2012-07-01 2.867133 NA NA 1
2 2012-08-01 2.018745 2.867133 0 2
3 2012-09-01 5.237515 2.018745 1 3
4 2012-10-01 8.320493 5.237515 1 3
5 2012-11-01 4.119850 8.320493 1 3
6 2012-12-01 3.648649 4.119850 0 4
7 2013-01-01 3.172867 3.648649 0 4
8 2013-02-01 4.065041 3.172867 0 4
9 2013-03-01 2.914798 4.065041 0 4
10 2013-04-01 4.735683 2.914798 0 4
11 2013-05-01 3.775411 4.735683 0 4
12 2013-06-01 3.825717 3.775411 0 4
13 2013-07-01 3.273427 3.825717 0 4
14 2013-08-01 2.716469 3.273427 0 4
15 2013-09-01 2.687296 2.716469 0 4
16 2013-10-01 3.674121 2.687296 0 4
17 2013-11-01 3.325942 3.674121 0 4
18 2013-12-01 2.524038 3.325942 0 4

Cumsum reset at certain values [duplicate]

This question already has answers here:
Cumsum with reset when 0 is encountered and by groups
(2 answers)
Cumulative sum that resets when 0 is encountered
(4 answers)
Closed 5 years ago.
I have the following dataframe
x y count
1 1 2018-02-24 4.031540
2 2 2018-02-25 5.244303
3 3 2018-02-26 5.441465
4 NA 2018-02-27 4.164104
5 5 2018-02-28 5.172919
6 6 2018-03-01 5.591410
7 7 2018-03-02 4.691716
8 8 2018-03-03 5.465360
9 9 2018-03-04 3.269378
10 NA 2018-03-05 5.300679
11 11 2018-03-06 5.489664
12 12 2018-03-07 4.423334
13 13 2018-03-08 3.808764
14 14 2018-03-09 6.450136
15 15 2018-03-10 5.541785
16 16 2018-03-11 4.762889
17 17 2018-03-12 5.511649
18 18 2018-03-13 6.795386
19 19 2018-03-14 6.615762
20 20 2018-03-15 4.749151
I want to take the cumsum of the count column, but I want the the cumsum to restart when the x value is NA. I've tried the following:
df$cum_sum <- ifelse(is.na(df$x) == FALSE, cumsum(df$count), 0)
x y count cum_sum
1 1 2018-02-24 4.031540 4.031540
2 2 2018-02-25 5.244303 9.275843
3 3 2018-02-26 5.441465 14.717308
4 NA 2018-02-27 4.164104 0.000000
5 5 2018-02-28 5.172919 24.054331
6 6 2018-03-01 5.591410 29.645741
7 7 2018-03-02 4.691716 34.337458
8 8 2018-03-03 5.465360 39.802817
9 9 2018-03-04 3.269378 43.072195
10 NA 2018-03-05 5.300679 0.000000
11 11 2018-03-06 5.489664 53.862538
12 12 2018-03-07 4.423334 58.285871
13 13 2018-03-08 3.808764 62.094635
14 14 2018-03-09 6.450136 68.544771
15 15 2018-03-10 5.541785 74.086556
16 16 2018-03-11 4.762889 78.849445
17 17 2018-03-12 5.511649 84.361094
18 18 2018-03-13 6.795386 91.156480
19 19 2018-03-14 6.615762 97.772242
20 20 2018-03-15 4.749151 102.521394
The result is the cum_sum column is 0 at the NA values, but the cumsum doesn't reset. How can I fix this?
A possible solution:
dat$cum_sum <- ave(dat$count, cumsum(is.na(dat$x)), FUN = cumsum)
which gives:
> dat
x y count cum_sum
1 1 2018-02-24 4.031540 4.031540
2 2 2018-02-25 5.244303 9.275843
3 3 2018-02-26 5.441465 14.717308
4 NA 2018-02-27 4.164104 4.164104
5 5 2018-02-28 5.172919 9.337023
6 6 2018-03-01 5.591410 14.928433
7 7 2018-03-02 4.691716 19.620149
8 8 2018-03-03 5.465360 25.085509
9 9 2018-03-04 3.269378 28.354887
10 NA 2018-03-05 5.300679 5.300679
11 11 2018-03-06 5.489664 10.790343
12 12 2018-03-07 4.423334 15.213677
13 13 2018-03-08 3.808764 19.022441
14 14 2018-03-09 6.450136 25.472577
15 15 2018-03-10 5.541785 31.014362
16 16 2018-03-11 4.762889 35.777251
17 17 2018-03-12 5.511649 41.288900
18 18 2018-03-13 6.795386 48.084286
19 19 2018-03-14 6.615762 54.700048
20 20 2018-03-15 4.749151 59.449199
Or with dplyr:
library(dplyr)
dat %>%
group_by(grp = cumsum(is.na(x))) %>%
mutate(cum_sum = cumsum(count)) %>%
ungroup() %>%
select(-grp)
I have the data.table version
plouf <- setDT(df)
plouf[,group := cumsum(is.na(x))]
plouf[!is.na(x),cum_sum := cumsum(count),by = group]
x y count group cum_sum
1: 1 2018-02-24 4.031540 0 4.031540
2: 2 2018-02-25 5.244303 0 9.275843
3: 3 2018-02-26 5.441465 0 14.717308
4: NA 2018-02-27 4.164104 1 NA
5: 5 2018-02-28 5.172919 1 5.172919
6: 6 2018-03-01 5.591410 1 10.764329
7: 7 2018-03-02 4.691716 1 15.456045
8: 8 2018-03-03 5.465360 1 20.921405
9: 9 2018-03-04 3.269378 1 24.190783
10: NA 2018-03-05 5.300679 2 NA
11: 11 2018-03-06 5.489664 2 5.489664
12: 12 2018-03-07 4.423334 2 9.912998
13: 13 2018-03-08 3.808764 2 13.721762
14: 14 2018-03-09 6.450136 2 20.171898
15: 15 2018-03-10 5.541785 2 25.713683
16: 16 2018-03-11 4.762889 2 30.476572
17: 17 2018-03-12 5.511649 2 35.988221
18: 18 2018-03-13 6.795386 2 42.783607
19: 19 2018-03-14 6.615762 2 49.399369
20: 20 2018-03-15 4.749151 2 54.148520

Resources