R code to get max count of time series data by group - r

I'd like to get a summary of time series data where group is "Flare" and the max value of the FlareLength is the data of interest for that group.
If I have a dataframe, like this:
Date Flare FlareLength
1 2015-12-01 0 1
2 2015-12-02 0 2
3 2015-12-03 0 3
4 2015-12-04 0 4
5 2015-12-05 0 5
6 2015-12-06 0 6
7 2015-12-07 1 1
8 2015-12-08 1 2
9 2015-12-09 1 3
10 2015-12-10 1 4
11 2015-12-11 0 1
12 2015-12-12 0 2
13 2015-12-13 0 3
14 2015-12-14 0 4
15 2015-12-15 0 5
16 2015-12-16 0 6
17 2015-12-17 0 7
18 2015-12-18 0 8
19 2015-12-19 0 9
20 2015-12-20 0 10
21 2015-12-21 0 11
22 2016-01-11 1 1
23 2016-01-12 1 2
24 2016-01-13 1 3
25 2016-01-14 1 4
26 2016-01-15 1 5
27 2016-01-16 1 6
28 2016-01-17 1 7
29 2016-01-18 1 8
I'd like output like:
Date Flare FlareLength
1 2015-12-06 0 6
2 2015-12-10 1 4
3 2015-12-21 0 11
4 2016-01-18 1 8
I have tried various aggregate forms but I'm not very familiar with the time series wrinkle.

Using dplyr, we can create a grouping variable by comparing the FlareLength with the previous FlareLength value and select the row with maximum FlareLength in the group.
library(dplyr)
df %>%
group_by(gr = cumsum(FlareLength < lag(FlareLength,
default = first(FlareLength)))) %>%
slice(which.max(FlareLength)) %>%
ungroup() %>%
select(-gr)
# A tibble: 4 x 3
# Date Flare FlareLength
# <fct> <int> <int>
#1 2015-12-06 0 6
#2 2015-12-10 1 4
#3 2015-12-21 0 11
#4 2016-01-18 1 8
In base R with ave we can do the same as
subset(df, FlareLength == ave(FlareLength, cumsum(c(TRUE, diff(FlareLength) < 0)),
FUN = max))

Related

Group two dfs based on dates that closely match

These are subsets of two dataframes.
df1:
plot
mean_first_flower_date
gdd
1
2019-07-15
60
1
2019-07-21
50
1
2019-07-23
78
2
2019-05-13
100
2
2019-05-22
173
2
2019-05-25
245
(cont.)
df2:
plot
date
flowers
1
2019-07-12
2
1
2019-07-13
9
1
2019-07-14
3
1
2019-07-15
3
2
2019-05-12
10
2
2019-05-13
10
2
2019-05-14
14
2
2019-05-15
17
(cont.)
df2 has some matching dates with df1 but sometimes the dates are off for one or a couple days (highlighted in bold).
I would like to group both dfs based on both 'date' and 'plot', keeping df2, without losing 'gdd' data from df1.
This will happen if, for example, I inner_join both dfs because the dates will not match.
So if a date in df1 is one to three days earlier or later than what it's possible to match in df2, it's fine because the dates are relatively close. This is tricky because I want this data replacement only if there is not data available in df1 for that data range.
My goal is to have something like this:
plot
date
flowers
gdd
1
2019-07-12
2
60
1
2019-07-13
9
60
1
2019-07-14
3
60
1
2019-07-15
3
60
2
2019-05-12
10
100
2
2019-05-13
10
100
2
2019-05-14
14
100
2
2019-05-15
17
100
Is it possible to do?
I greatly appreciate any help!
Thanks!
I think a 'rolling join' from the data.table package can handle this:
library(data.table)
setDT(df1)
setDT(df2)
df1[, mean_first_flower_date := as.Date(mean_first_flower_date)]
df2[, date := as.Date(date)]
df1[df2, on=c("plot","mean_first_flower_date==date"), roll=3, rollends=TRUE]
# plot mean_first_flower_date gdd flowers
#1: 1 2019-07-12 60 2
#2: 1 2019-07-13 60 9
#3: 1 2019-07-14 60 3
#4: 1 2019-07-15 60 3
#5: 2 2019-05-12 100 10
#6: 2 2019-05-13 100 10
#7: 2 2019-05-14 100 14
#8: 2 2019-05-15 100 17
Using this data:
df1 <- read.table(text="plot mean_first_flower_date gdd
1 2019-07-15 60
1 2019-07-21 50
1 2019-07-23 78
2 2019-05-13 100
2 2019-05-22 173
2 2019-05-25 245", header=TRUE)
df2 <- read.table(text="plot date flowers
1 2019-07-12 2
1 2019-07-13 9
1 2019-07-14 3
1 2019-07-15 3
2 2019-05-12 10
2 2019-05-13 10
2 2019-05-14 14
2 2019-05-15 17", header=TRUE)
Try fill from dplyr. use this syntax
df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
fill(gdd, .direction = "up")
plot date flowers gdd
1 1 2019-07-12 2 60
2 1 2019-07-13 9 60
3 1 2019-07-14 3 60
4 1 2019-07-15 3 60
5 2 2019-05-12 10 100
6 2 2019-05-13 10 100
7 2 2019-05-14 14 NA
8 2 2019-05-15 17 NA
As you can notice there are two NAs in the last two rows which shouldn't be there if you'll join your actual df2 where these rows will be filled by 173 as there will be a match for 2019-05-22. Still if you want to fill the last NA rows, if any, you can use fill again with .direction = "down"
df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
fill(gdd, .direction = "up") %>% fill(gdd, .direction = "down")
plot date flowers gdd
1 1 2019-07-12 2 60
2 1 2019-07-13 9 60
3 1 2019-07-14 3 60
4 1 2019-07-15 3 60
5 2 2019-05-12 10 100
6 2 2019-05-13 10 100
7 2 2019-05-14 14 100
8 2 2019-05-15 17 100

How to print a date when the input is number of days since 01-01-60?

I received a set of dates, but it turns out that time is reported in days since 01-01-1960 in this specific data set.
D_INDDTO
1 20758
2 20856
3 21062
4 19740
5 21222
6 21203
The specific date of interest for Patient 1 is 20758 days since 01-01-60
I want to create a new covariate u$date containing the specific date of interest i d%m%y%. I tried
library(tidyverse)
u %>% mutate(date=as.date(D_INDDTO,origin="1960-01-01")
But that did not solve it.
u <- structure(list(D_INDDTO = c(20758, 20856, 21062, 19740, 21222,
21203, 20976, 20895, 18656, 18746)), row.names = c(NA, 10L), class = "data.frame")
Try this:
#Code 1
u %>% mutate(date=as.Date("1960-01-01")+D_INDDTO)
Output:
D_INDDTO date
1 20758 2016-10-31
2 20856 2017-02-06
3 21062 2017-08-31
4 19740 2014-01-17
5 21222 2018-02-07
6 21203 2018-01-19
7 20976 2017-06-06
8 20895 2017-03-17
9 18656 2011-01-29
10 18746 2011-04-29
Or this:
#Code 2
u %>% mutate(date=as.Date(D_INDDTO,origin="1960-01-01"))
Output:
D_INDDTO date
1 20758 2016-10-31
2 20856 2017-02-06
3 21062 2017-08-31
4 19740 2014-01-17
5 21222 2018-02-07
6 21203 2018-01-19
7 20976 2017-06-06
8 20895 2017-03-17
9 18656 2011-01-29
10 18746 2011-04-29
Or this:
#Code 3
u %>% mutate(date=format(as.Date(D_INDDTO,origin="1960-01-01"),'%d%m%y'))
Output:
D_INDDTO date
1 20758 311016
2 20856 060217
3 21062 310817
4 19740 170114
5 21222 070218
6 21203 190118
7 20976 060617
8 20895 170317
9 18656 290111
10 18746 290411
If more customization is required:
#Code 4
u %>% mutate(date=format(as.Date(D_INDDTO,origin="1960-01-01"),'%d-%m-%Y'))
Output:
D_INDDTO date
1 20758 31-10-2016
2 20856 06-02-2017
3 21062 31-08-2017
4 19740 17-01-2014
5 21222 07-02-2018
6 21203 19-01-2018
7 20976 06-06-2017
8 20895 17-03-2017
9 18656 29-01-2011
10 18746 29-04-2011

Growth Rate for daily data

I have a data for selling some product and I would like to calculate the growth rate of this data such that N_win and N_lose are the win and lose over a period of time 1-19 March. Also, I would like to predict the growth rate and win and lose?
Date N_win N_lose tot1 tot2
1 2018-03-01 0 0 0 0
2 2018-03-02 1 0 1 1
3 2018-03-03 0 0 1 1
4 2018-03-04 1 0 2 2
5 2018-03-05 3 0 5 5
6 2018-03-06 0 0 5 5
7 2018-03-07 2 0 7 7
8 2018-03-08 4 0 11 11
9 2018-03-09 4 0 15 15
10 2018-03-10 5 0 20 20
11 2018-03-11 1 1 21 20
12 2018-03-12 24 1 45 44
13 2018-03-13 41 1 86 85
14 2018-03-14 17 2 103 101
15 2018-03-15 15 3 118 115
16 2018-03-16 15 6 133 127
17 2018-03-17 38 6 171 165
18 2018-03-18 67 6 238 232
I tried to apply this function but it seems not working
Growthrate = function(x1,x2, n){
gr = (x2/x1)^(1/n)-1
return(gr)
}
GR = NULL
for(i in 1:length(DF[,1])){
GR[i] = Growthrate(DF[i,2],DF[i+1,2], sum(i))
}

How can I receive from a column data points with the highest Information (Gain)?

Suppose I have this dataframe:
> df1
date count
1 2012-07-01 2.867133
2 2012-08-01 2.018745
3 2012-09-01 5.237515
4 2012-10-01 8.320493
5 2012-11-01 4.119850
6 2012-12-01 3.648649
7 2013-01-01 3.172867
8 2013-02-01 4.065041
9 2013-03-01 2.914798
10 2013-04-01 4.735683
11 2013-05-01 3.775411
12 2013-06-01 3.825717
13 2013-07-01 3.273427
14 2013-08-01 2.716469
15 2013-09-01 2.687296
16 2013-10-01 3.674121
17 2013-11-01 3.325942
18 2013-12-01 2.524038
I now want to split df1$count in such a way, that I get groups/ranges of where the Information is the highest. My thoughts go towards Information Gain, but I know IG is for attributes, not a column.
If you plot the data, you can distinguish a high rise and decrease...so my goal is to always find these significant increases/decreases which contain a high Information Gain.
Any ideas on how I could do this?
Something like this?
df1%>%
mutate(dif=ifelse((lag(count)-count)>0,0,1))%>%
mutate(group=rle(dif) %>% magrittr::extract2("lengths") %>% rep(seq_along(.), .))
date count dif group
1 2012-07-01 2.867133 NA 1
2 2012-08-01 2.018745 0 2
3 2012-09-01 5.237515 1 3
4 2012-10-01 8.320493 1 3
5 2012-11-01 4.119850 0 4
6 2012-12-01 3.648649 0 4
7 2013-01-01 3.172867 0 4
8 2013-02-01 4.065041 1 5
9 2013-03-01 2.914798 0 6
10 2013-04-01 4.735683 1 7
11 2013-05-01 3.775411 0 8
12 2013-06-01 3.825717 1 9
13 2013-07-01 3.273427 0 10
14 2013-08-01 2.716469 0 10
15 2013-09-01 2.687296 0 10
16 2013-10-01 3.674121 1 11
17 2013-11-01 3.325942 0 12
18 2013-12-01 2.524038 0 12
UPDATE
df1%>%
mutate(nxt=lag(count),
dif=ifelse( abs(count-lag(count))>2 | count/lag(count)>3 | lag(count)/count>3,1,0))%>%
+ mutate(group=rle(dif) %>% magrittr::extract2("lengths") %>% rep(seq_along(.), .))
date count nxt dif group
1 2012-07-01 2.867133 NA NA 1
2 2012-08-01 2.018745 2.867133 0 2
3 2012-09-01 5.237515 2.018745 1 3
4 2012-10-01 8.320493 5.237515 1 3
5 2012-11-01 4.119850 8.320493 1 3
6 2012-12-01 3.648649 4.119850 0 4
7 2013-01-01 3.172867 3.648649 0 4
8 2013-02-01 4.065041 3.172867 0 4
9 2013-03-01 2.914798 4.065041 0 4
10 2013-04-01 4.735683 2.914798 0 4
11 2013-05-01 3.775411 4.735683 0 4
12 2013-06-01 3.825717 3.775411 0 4
13 2013-07-01 3.273427 3.825717 0 4
14 2013-08-01 2.716469 3.273427 0 4
15 2013-09-01 2.687296 2.716469 0 4
16 2013-10-01 3.674121 2.687296 0 4
17 2013-11-01 3.325942 3.674121 0 4
18 2013-12-01 2.524038 3.325942 0 4

R - Calculate Time Elapsed Since Last Event with Multiple Event Types

I have a dataframe that contains the dates of multiple types of events.
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000"
,"03/01/2001","17/03/2001","23/04/2001",
"26/05/2001","01/06/2001",
"30/06/2001","02/07/2001","15/07/2001"
,"21/12/2001"), "%d/%m/%Y"),
event_type=c(0,4,1,2,4,1,0,2,3,3,4,3))
date event_type
---------------- ----------
1 2000-07-06 0
2 2000-09-15 4
3 2000-10-15 1
4 2001-01-03 2
5 2001-03-17 4
6 2001-04-23 1
7 2001-05-26 0
8 2001-06-01 2
9 2001-06-30 3
10 2001-07-02 3
11 2001-07-15 4
12 2001-12-21 3
I am trying to calculate the days between each event type so the output looks like the below:
date event_type days_since_last_event
---------------- ---------- ---------------------
1 2000-07-06 0 NA
2 2000-09-15 4 NA
3 2000-10-15 1 NA
4 2001-01-03 2 NA
5 2001-03-17 4 183
6 2001-04-23 1 190
7 2001-05-26 0 324
8 2001-06-01 2 149
9 2001-06-30 3 NA
10 2001-07-02 3 2
11 2001-07-15 4 120
12 2001-12-21 3 172
I have benefited from the answers from these two previous posts but have not been able to address my specific problem in R; multiple event types.
Calculate elapsed time since last event
Calculate days since last event in R
Below is as far as I have gotten. I have not been able to leverage the last event index to calculate the last event date.
df <- cbind(df, as.vector(data.frame(count=ave(df$event_type==df$event_type,
df$event_type, FUN=cumsum))))
df <- rename(df, c("count" = "last_event_index"))
date event_type last_event_index
--------------- ------------- ----------------
1 2000-07-06 0 1
2 2000-09-15 4 1
3 2000-10-15 1 1
4 2001-01-03 2 1
5 2001-03-17 4 2
6 2001-04-23 1 2
7 2001-05-26 0 2
8 2001-06-01 2 2
9 2001-06-30 3 1
10 2001-07-02 3 2
11 2001-07-15 4 3
12 2001-12-21 3 3
We can use diff to get the difference between adjacent 'date' after grouping by 'event_type'. Here, I am using data.table approach by converting the 'data.frame' to 'data.table' (setDT(df)), grouped by 'event_type', we get the diff of 'date'.
library(data.table)
setDT(df)[,days_since_last_event :=c(NA,diff(date)) , by = event_type]
df
# date event_type days_since_last_event
# 1: 2000-07-06 0 NA
# 2: 2000-09-15 4 NA
# 3: 2000-10-15 1 NA
# 4: 2001-01-03 2 NA
# 5: 2001-03-17 4 183
# 6: 2001-04-23 1 190
# 7: 2001-05-26 0 324
# 8: 2001-06-01 2 149
# 9: 2001-06-30 3 NA
#10: 2001-07-02 3 2
#11: 2001-07-15 4 120
#12: 2001-12-21 3 172
Or as #Frank mentioned in the comments, we can also use shift (from version v1.9.5+ onwards) to get the lag (by default, the type='lag') of 'date' and subtract from the 'date'.
setDT(df)[, days_since_last_event := as.numeric(date-shift(date,type="lag")),
by = event_type]
The base R version of this is to use split/lapply/rbind to generate the new column.
> do.call(rbind,
lapply(
split(df, df$event_type),
function(d) {
d$dsle <- c(NA, diff(d$date)); d
}
)
)
date event_type dsle
0.1 2000-07-06 0 NA
0.7 2001-05-26 0 324
1.3 2000-10-15 1 NA
1.6 2001-04-23 1 190
2.4 2001-01-03 2 NA
2.8 2001-06-01 2 149
3.9 2001-06-30 3 NA
3.10 2001-07-02 3 2
3.12 2001-12-21 3 172
4.2 2000-09-15 4 NA
4.5 2001-03-17 4 183
4.11 2001-07-15 4 120
Note that this returns the data in a different order than provided; you can re-sort by date or save the original indices if you want to preserve that order.
Above, #akrun has posted the data.tables approach, the parallel dplyr approach would be straightforward as well:
library(dplyr)
df %>% group_by(event_type) %>% mutate(days_since_last_event=date - lag(date, 1))
Source: local data frame [12 x 3]
Groups: event_type [5]
date event_type days_since_last_event
(date) (dbl) (dfft)
1 2000-07-06 0 NA days
2 2000-09-15 4 NA days
3 2000-10-15 1 NA days
4 2001-01-03 2 NA days
5 2001-03-17 4 183 days
6 2001-04-23 1 190 days
7 2001-05-26 0 324 days
8 2001-06-01 2 149 days
9 2001-06-30 3 NA days
10 2001-07-02 3 2 days
11 2001-07-15 4 120 days
12 2001-12-21 3 172 days

Resources