seqentially number groups based on a condition - r

I need some help with my R code - I've been trying to get it to work for ages and i'm totally stuck.
I have a large dataset (~40000 rows) and I need to assign group IDs to a new column based on a condition of another column. So if df$flow.type==1 then then that [SITENAME, SAMPLING.YEAR, cluster] group should be assigned with a unique group ID. This is an example:
This is a similar question but for SQL: Assigning group number based on condition. I need a way to do this in R - sorry I am a novice at if_else and loops. The below code is the best I could come up with but it isn't working. Can anyone see what i'm doing wrong?
thanks in advance for your help
if(flow.type.test=="0"){
event.samp.num.test <- "1000"
} else (flow.type.test=="1"){
event.samp.num.test <- Sample_dat %>% group_by(SITENAME, SAMPLING.YEAR, cluster) %>% tally()}
Note the group ID '1000' is just a random impossible number for this dataset - it will be used to subset the data later on.
My subset df looks like this:
> str(dummydat)
'data.frame': 68 obs. of 6 variables:
$ SITENAME : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
$ SAMPLING.YEAR: Factor w/ 4 levels "1","2","3","4": 3 3 3 3 3 3 3 3 3 4 ...
$ DATE : Date, format: "2017-10-17" "2017-10-17" "2017-10-22" "2017-11-28" ...
$ TIME : chr "10:45" "15:00" "15:20" "20:59" ...
$ flow.type : int 1 1 0 0 1 1 0 0 0 1 ...
$ cluster : int 1 1 2 3 4 4 5 6 7 8 ...
Sorry I tried dput but the output is horrendous. I have subset 40 rows of the subset data below as an example, I hope this is okay.
> head(dummydat, n=40)
SITENAME SAMPLING.YEAR DATE TIME flow.type cluster
1 A 3 2017-10-17 10:45 1 1
2 A 3 2017-10-17 15:00 1 1
3 A 3 2017-10-22 15:20 0 2
4 A 3 2017-11-28 20:59 0 3
5 A 3 2017-12-05 18:15 1 4
6 A 3 2017-12-06 8:25 1 4
7 A 3 2017-12-10 10:05 0 5
8 A 3 2017-12-15 15:12 0 6
9 A 3 2017-12-19 17:40 0 7
10 A 4 2018-12-09 18:10 1 8
11 A 4 2018-12-16 10:35 0 9
12 A 4 2018-12-26 6:47 0 10
13 A 4 2019-01-01 14:25 0 11
14 A 4 2019-01-05 16:40 0 12
15 A 4 2019-01-12 7:42 0 13
16 A 4 2019-01-20 16:15 0 14
17 A 4 2019-01-28 10:41 0 15
18 A 4 2019-02-03 16:30 1 16
19 A 4 2019-02-04 17:14 1 16
20 B 1 2015-12-24 6:21 1 16
21 B 1 2015-12-29 17:41 1 17
22 B 1 2015-12-29 23:33 1 17
23 B 1 2015-12-30 5:17 1 17
24 B 1 2015-12-30 17:23 1 17
25 B 1 2015-12-31 5:29 1 17
26 B 1 2015-12-31 11:35 1 17
27 B 1 2015-12-31 23:40 1 17
28 B 1 2016-02-09 10:53 0 18
29 B 1 2016-03-03 15:23 1 19
30 B 1 2016-03-03 17:37 1 19
31 B 1 2016-03-03 21:33 1 19
32 B 1 2016-03-04 3:17 1 19
33 B 2 2017-01-07 13:16 1 20
34 B 2 2017-01-07 22:24 1 20
35 B 2 2017-01-08 6:34 1 20
36 B 2 2017-01-08 11:42 1 20
37 B 2 2017-01-08 20:50 1 20
38 B 2 2017-01-31 11:39 1 21
39 B 2 2017-01-31 16:45 1 21
40 B 2 2017-01-31 22:53 1 21

Here is one approach with tidyverse:
library(dplyr)
library(tidyr)
left_join(df, df %>%
filter(flow.type == 1) %>%
group_by(SITENAME, SAMPLING.YEAR) %>%
mutate(group.ID = cumsum(cluster != lag(cluster, default = first(cluster))) + 1)) %>%
mutate(group.ID = replace_na(group.ID, 1000))
First, filter rows that have flow.type of 1. Then, group_by both SITENAME and SAMPLING.YEAR to count groups within those same characteristics. Next, use cumsum for cumulative sum of when cluster value changes - this will be the group number. This will be merged back with original data (left_join). To have those with flow.type 0 become 1000 for group.ID, you can use replace_na.
Output
SITENAME SAMPLING.YEAR DATE TIME flow.type cluster group.ID
1 A 3 2017-10-17 10:45 1 1 1
2 A 3 2017-10-17 15:00 1 1 1
3 A 3 2017-10-22 15:20 0 2 1000
4 A 3 2017-11-28 20:59 0 3 1000
5 A 3 2017-12-05 18:15 1 4 2
6 A 3 2017-12-06 8:25 1 4 2
7 A 3 2017-12-10 10:05 0 5 1000
8 A 3 2017-12-15 15:12 0 6 1000
9 A 3 2017-12-19 17:40 0 7 1000
10 A 4 2018-12-09 18:10 1 8 1
11 A 4 2018-12-16 10:35 0 9 1000
12 A 4 2018-12-26 6:47 0 10 1000
13 A 4 2019-01-01 14:25 0 11 1000
14 A 4 2019-01-05 16:40 0 12 1000
15 A 4 2019-01-12 7:42 0 13 1000
16 A 4 2019-01-20 16:15 0 14 1000
17 A 4 2019-01-28 10:41 0 15 1000
18 A 4 2019-02-03 16:30 1 16 2
19 A 4 2019-02-04 17:14 1 16 2
20 B 1 2015-12-24 6:21 1 16 1
21 B 1 2015-12-29 17:41 1 17 2
22 B 1 2015-12-29 23:33 1 17 2
23 B 1 2015-12-30 5:17 1 17 2
24 B 1 2015-12-30 17:23 1 17 2
25 B 1 2015-12-31 5:29 1 17 2
26 B 1 2015-12-31 11:35 1 17 2
27 B 1 2015-12-31 23:40 1 17 2
28 B 1 2016-02-09 10:53 0 18 1000
29 B 1 2016-03-03 15:23 1 19 3
30 B 1 2016-03-03 17:37 1 19 3
31 B 1 2016-03-03 21:33 1 19 3
32 B 1 2016-03-04 3:17 1 19 3
33 B 2 2017-01-07 13:16 1 20 1
34 B 2 2017-01-07 22:24 1 20 1
35 B 2 2017-01-08 6:34 1 20 1
36 B 2 2017-01-08 11:42 1 20 1
37 B 2 2017-01-08 20:50 1 20 1
38 B 2 2017-01-31 11:39 1 21 2
39 B 2 2017-01-31 16:45 1 21 2
40 B 2 2017-01-31 22:53 1 21 2

Here is a data.table approach
library(data.table)
setDT(df)[
, group.ID := 1000
][
flow.type == 1, group.ID := copy(.SD)[, grp := .GRP, by = cluster]$grp,
by = .(SITENAME, SAMPLING.YEAR)
]
Output
> df[]
SITENAME SAMPLING.YEAR DATE TIME flow.type cluster group.ID
1: A 3 2017-10-17 10:45:00 1 1 1
2: A 3 2017-10-17 15:00:00 1 1 1
3: A 3 2017-10-22 15:20:00 0 2 1000
4: A 3 2017-11-28 20:59:00 0 3 1000
5: A 3 2017-12-05 18:15:00 1 4 2
6: A 3 2017-12-06 08:25:00 1 4 2
7: A 3 2017-12-10 10:05:00 0 5 1000
8: A 3 2017-12-15 15:12:00 0 6 1000
9: A 3 2017-12-19 17:40:00 0 7 1000
10: A 4 2018-12-09 18:10:00 1 8 1
11: A 4 2018-12-16 10:35:00 0 9 1000
12: A 4 2018-12-26 06:47:00 0 10 1000
13: A 4 2019-01-01 14:25:00 0 11 1000
14: A 4 2019-01-05 16:40:00 0 12 1000
15: A 4 2019-01-12 07:42:00 0 13 1000
16: A 4 2019-01-20 16:15:00 0 14 1000
17: A 4 2019-01-28 10:41:00 0 15 1000
18: A 4 2019-02-03 16:30:00 1 16 2
19: A 4 2019-02-04 17:14:00 1 16 2
20: B 1 2015-12-24 06:21:00 1 16 1
21: B 1 2015-12-29 17:41:00 1 17 2
22: B 1 2015-12-29 23:33:00 1 17 2
23: B 1 2015-12-30 05:17:00 1 17 2
24: B 1 2015-12-30 17:23:00 1 17 2
25: B 1 2015-12-31 05:29:00 1 17 2
26: B 1 2015-12-31 11:35:00 1 17 2
27: B 1 2015-12-31 23:40:00 1 17 2
28: B 1 2016-02-09 10:53:00 0 18 1000
29: B 1 2016-03-03 15:23:00 1 19 3
30: B 1 2016-03-03 17:37:00 1 19 3
31: B 1 2016-03-03 21:33:00 1 19 3
32: B 1 2016-03-04 03:17:00 1 19 3
33: B 2 2017-01-07 13:16:00 1 20 1
34: B 2 2017-01-07 22:24:00 1 20 1
35: B 2 2017-01-08 06:34:00 1 20 1
36: B 2 2017-01-08 11:42:00 1 20 1
37: B 2 2017-01-08 20:50:00 1 20 1
38: B 2 2017-01-31 11:39:00 1 21 2
39: B 2 2017-01-31 16:45:00 1 21 2
40: B 2 2017-01-31 22:53:00 1 21 2
SITENAME SAMPLING.YEAR DATE TIME flow.type cluster group.ID

Related

Sum in R based on a date range and another condition?

I am working on a dataframe of baseball data called mlb_team_logs. A random sample lies below.
Date Team season AB PA H X1B X2B X3B HR R RBI BB IBB SO HBP SF SH GDP
1 2015-04-06 ARI 2015 34 39 9 7 1 1 0 4 4 3 0 6 2 0 0 2
2 2015-04-07 ARI 2015 31 36 8 4 1 1 2 7 7 5 0 7 0 0 0 1
3 2015-04-08 ARI 2015 32 35 5 3 2 0 0 2 1 2 0 7 1 0 0 0
4 2015-04-10 ARI 2015 35 38 7 6 0 0 1 4 4 3 0 10 0 0 0 0
5 2015-04-11 ARI 2015 32 35 10 9 0 0 1 6 6 3 0 7 0 0 0 1
6 2015-04-12 ARI 2015 36 38 10 7 3 0 0 4 4 1 0 11 0 0 1 1
7 2015-04-13 ARI 2015 39 44 12 8 3 1 0 8 7 4 0 11 0 0 1 0
8 2015-04-14 ARI 2015 28 32 3 1 2 0 0 1 1 3 0 4 1 0 0 2
9 2015-04-15 ARI 2015 33 34 9 7 1 0 1 2 2 1 0 8 0 0 0 1
10 2015-04-16 ARI 2015 47 51 11 6 2 0 3 7 7 3 1 8 1 0 0 0
240 2015-07-03 ATL 2015 30 32 7 4 1 0 2 2 2 2 0 6 0 0 0 1
241 2015-07-04 ATL 2015 34 40 10 6 3 0 1 9 9 5 0 5 0 0 1 0
242 2015-07-05 ATL 2015 35 37 7 6 1 0 0 0 0 1 0 10 1 0 0 1
243 2015-07-06 ATL 2015 40 44 15 10 4 0 1 5 5 3 0 7 0 0 1 1
244 2015-07-07 ATL 2015 34 37 10 7 1 1 1 4 4 2 0 4 0 0 1 1
245 2015-07-08 ATL 2015 31 38 7 4 1 0 2 5 5 5 1 7 0 0 2 1
246 2015-07-09 ATL 2015 34 37 10 8 2 0 0 3 3 1 0 9 0 1 1 2
247 2015-07-10 ATL 2015 32 35 8 7 0 0 1 3 3 2 0 5 1 0 0 2
248 2015-07-11 ATL 2015 33 38 6 3 1 0 2 2 2 5 1 8 0 0 0 0
249 2015-07-12 ATL 2015 34 41 8 6 2 0 0 3 3 7 1 10 0 0 0 1
250 2015-07-17 ATL 2015 30 36 7 4 3 0 0 4 4 5 1 7 0 0 0 0
In total, the df has 43 total columns. My objective is to sum columns 4 (AB) to 43 on two criteria:
the team
the date is within 7 days of the entry in "Date" (ie Date - 7 to Date - 1)
Eventually, I would like these columns to be appended to mlb_team_logs as l7_AB, l7_PA, etc (but I know how to do that if the output will be a new dataframe). Any help is appreciated!
EDIT I altered the sample to allow for more easily tested results
You might be able to use a data.table non-equi join here. The idea would be to create a lower date bound (below, I've named this date_lb), and then join the table on itself, matching on Team = Team, Date < Date, and Date >= date_lb. Then use lapply with .SDcols to sum the columns of interest.
load library and set your frame to data.table
library(data.table)
setDT(mlb_team_logs)
Identify the columns you want to sum, in a character vector (change to 4:43 in your full dataset)
sum_cols = names(mlb_team_logs)[4:19]
Add a lower bound on date
df[, date_lb := Date-7]
Join the table on itself, and use lapply(.SD, sum) on the columns of interest
result = mlb_team_logs[mlb_team_logs[, .(Team, Date, date_lb)], on=.(Team, Date<Date, Date>=date_lb)] %>%
.[, lapply(.SD, sum), by=.(Date,Team), .SDcols = sumcols ]
Set the new names (inplace, using setnames())
setnames(result, old=sumcols, new=paste0("I7_",sumcols))
Output:
Date Team I7_AB I7_PA I7_H I7_X1B I7_X2B I7_X3B I7_HR I7_R I7_RBI I7_BB I7_IBB I7_SO I7_HBP I7_SF I7_SH I7_GDP
<IDat> <char> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1: 2015-04-06 ARI NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2: 2015-04-07 ARI 34 39 9 7 1 1 0 4 4 3 0 6 2 0 0 2
3: 2015-04-08 ARI 65 75 17 11 2 2 2 11 11 8 0 13 2 0 0 3
4: 2015-04-10 ARI 97 110 22 14 4 2 2 13 12 10 0 20 3 0 0 3
5: 2015-04-11 ARI 132 148 29 20 4 2 3 17 16 13 0 30 3 0 0 3
6: 2015-04-12 ARI 164 183 39 29 4 2 4 23 22 16 0 37 3 0 0 4
7: 2015-04-13 ARI 200 221 49 36 7 2 4 27 26 17 0 48 3 0 1 5
8: 2015-04-14 ARI 205 226 52 37 9 2 4 31 29 18 0 53 1 0 2 3
9: 2015-04-15 ARI 202 222 47 34 10 1 2 25 23 16 0 50 2 0 2 4
10: 2015-04-16 ARI 203 221 51 38 9 1 3 25 24 15 0 51 1 0 2 5
11: 2015-07-03 ATL NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
12: 2015-07-04 ATL 30 32 7 4 1 0 2 2 2 2 0 6 0 0 0 1
13: 2015-07-05 ATL 64 72 17 10 4 0 3 11 11 7 0 11 0 0 1 1
14: 2015-07-06 ATL 99 109 24 16 5 0 3 11 11 8 0 21 1 0 1 2
15: 2015-07-07 ATL 139 153 39 26 9 0 4 16 16 11 0 28 1 0 2 3
16: 2015-07-08 ATL 173 190 49 33 10 1 5 20 20 13 0 32 1 0 3 4
17: 2015-07-09 ATL 204 228 56 37 11 1 7 25 25 18 1 39 1 0 5 5
18: 2015-07-10 ATL 238 265 66 45 13 1 7 28 28 19 1 48 1 1 6 7
19: 2015-07-11 ATL 240 268 67 48 12 1 6 29 29 19 1 47 2 1 6 8
20: 2015-07-12 ATL 239 266 63 45 10 1 7 22 22 19 2 50 2 1 5 8
21: 2015-07-17 ATL 99 114 22 16 3 0 3 8 8 14 2 23 1 0 0 3
Date Team I7_AB I7_PA I7_H I7_X1B I7_X2B I7_X3B I7_HR I7_R I7_RBI I7_BB I7_IBB I7_SO I7_HBP I7_SF I7_SH I7_GDP

Sum up ending with the current observation starting based on a criteria

I observe the number of purchases of (in the example below: 4) different customers on (five) different days. Now I want to create a new variable summing up the number of purchases of every single user during the last 20 purchases that have been made in total, across users.
Example data:
> da <- data.frame(customer_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
+ day = c("2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15"),
+ n_purchase = c(5,2,8,0,3,2,0,3,4,0,2,4,5,1,0,2,3,5,0,3))
> da
customer_id day n_purchase
1 1 2016-04-11 5
2 1 2016-04-12 2
3 1 2016-04-13 8
4 1 2016-04-14 0
5 1 2016-04-15 3
6 2 2016-04-11 2
7 2 2016-04-12 0
8 2 2016-04-13 3
9 2 2016-04-14 4
10 2 2016-04-15 0
11 3 2016-04-11 2
12 3 2016-04-12 4
13 3 2016-04-13 5
14 3 2016-04-14 1
15 3 2016-04-15 0
16 4 2016-04-11 2
17 4 2016-04-12 3
18 4 2016-04-13 5
19 4 2016-04-14 0
20 4 2016-04-15 3
I need to know three things to construct my variable:
(1) What's the overall number of purchases on a day across users (day purchases)?
(2) What's the cumulative number of purchases across users starting from the first day (cumsum_day_purchases)?
(3) On which day did, originating from the current observation, the 20 immediately precending (across users) purchases start? This is where I have issues with coding such a variable.
> library(dplyr)
> da %>%
+ group_by(day) %>%
+ mutate(day_purchases = sum(n_purchase)) %>%
+ group_by(customer_id) %>%
+ mutate(cumsum_day_purchases = cumsum(day_purchases))
# A tibble: 20 x 5
# Groups: customer_id [4]
customer_id day n_purchase day_purchases cumsum_day_purchases
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 2016-04-11 5 11 11
2 1 2016-04-12 2 9 20
3 1 2016-04-13 8 21 41
4 1 2016-04-14 0 5 46
5 1 2016-04-15 3 6 52
6 2 2016-04-11 2 11 11
7 2 2016-04-12 0 9 20
8 2 2016-04-13 3 21 41
9 2 2016-04-14 4 5 46
10 2 2016-04-15 0 6 52
11 3 2016-04-11 2 11 11
12 3 2016-04-12 4 9 20
13 3 2016-04-13 5 21 41
14 3 2016-04-14 1 5 46
15 3 2016-04-15 0 6 52
16 4 2016-04-11 2 11 11
17 4 2016-04-12 3 9 20
18 4 2016-04-13 5 21 41
19 4 2016-04-14 0 5 46
20 4 2016-04-15 3 6 52
I will now in the following dataset compute the variable I wish to have by hand.
For all observations on day 2016-04-12 , I compute the cumulative sum
of purchases of a specific customer by adding the number of purchases
of the current day and the precending day, because in total all
customers together made 20 purchases on the current day and the
precending day.
For day 2016-04-13, I only use the number of purchases of a user on
this day, because there have been 21 (41-20) new purchases on the day itself
Resulting in the following output:
> da = da %>% ungroup() %>%
+ mutate(cumsum_last_20_purchases = c(5,5+2,8,0,0+3,2,2+0,3,4,4+0,2,2+4,5,1,1+0,2,2+3,5,0,0+3))
> da
# A tibble: 20 x 6
customer_id day n_purchase day_purchases cumsum_day_purchases cumsum_last_20_purchases
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 2016-04-11 5 11 11 5
2 1 2016-04-12 2 9 20 7
3 1 2016-04-13 8 21 41 8
4 1 2016-04-14 0 5 46 0
5 1 2016-04-15 3 6 52 3
6 2 2016-04-11 2 11 11 2
7 2 2016-04-12 0 9 20 2
8 2 2016-04-13 3 21 41 3
9 2 2016-04-14 4 5 46 4
10 2 2016-04-15 0 6 52 4
11 3 2016-04-11 2 11 11 2
12 3 2016-04-12 4 9 20 6
13 3 2016-04-13 5 21 41 5
14 3 2016-04-14 1 5 46 1
15 3 2016-04-15 0 6 52 1
16 4 2016-04-11 2 11 11 2
17 4 2016-04-12 3 9 20 5
18 4 2016-04-13 5 21 41 5
19 4 2016-04-14 0 5 46 0
20 4 2016-04-15 3 6 52 3
We can create a new grouping based on the last day the day_purchase columns is above 20, and then use cumsum on that:
library(dplyr)
da %>%
group_by(day) %>%
mutate(day_purchases = sum(n_purchase)) %>%
group_by(customer_id) %>%
mutate(above = with(rle(day_purchases >= 20), rep(1:length(lengths), lengths))) %>%
group_by(above, .add =TRUE) %>%
mutate(cumsum_last_20_purchases = cumsum(n_purchase))
#> # A tibble: 20 x 6
#> # Groups: customer_id, above [12]
#> customer_id day n_purchase day_purchases above cumsum_last_20_purchas…
#> <dbl> <fct> <dbl> <dbl> <int> <dbl>
#> 1 1 2016-04-11 5 11 1 5
#> 2 1 2016-04-12 2 9 1 7
#> 3 1 2016-04-13 8 21 2 8
#> 4 1 2016-04-14 0 5 3 0
#> 5 1 2016-04-15 3 6 3 3
#> 6 2 2016-04-11 2 11 1 2
#> 7 2 2016-04-12 0 9 1 2
#> 8 2 2016-04-13 3 21 2 3
#> 9 2 2016-04-14 4 5 3 4
#> 10 2 2016-04-15 0 6 3 4
#> 11 3 2016-04-11 2 11 1 2
#> 12 3 2016-04-12 4 9 1 6
#> 13 3 2016-04-13 5 21 2 5
#> 14 3 2016-04-14 1 5 3 1
#> 15 3 2016-04-15 0 6 3 1
#> 16 4 2016-04-11 2 11 1 2
#> 17 4 2016-04-12 3 9 1 5
#> 18 4 2016-04-13 5 21 2 5
#> 19 4 2016-04-14 0 5 3 0
#> 20 4 2016-04-15 3 6 3 3
Created on 2020-07-28 by the reprex package (v0.3.0)

Grouping by changes in value while maintaining dates in R

I'm trying to group my data by subject_lab and changes to subject_value while maintaining dates of changes for each subject_value per subject_lab per subject_ID.
I've looked into dplyr and data.table examples scattered throughout stackoverflow, but I haven't found anything that works for my problem.
subject_id <- rep(1, each=10)
subject_date <- as.Date("2019-01-01"):(as.Date("2019-01-01")+29)
subject_date <- as.Date(subject_date, origin="1970-01-01")
subject_lab <- rep(1:3, each=10)
set.seed(123)
subject_value <- sample(0:4, size=30, replace=T)
subject_sample_df <- data.frame(subject_id, subject_date, subject_lab,
subject_value)
subject_id subject_date subject_lab subject_value
1 1 2019-01-01 1 1
2 1 2019-01-02 1 3
3 1 2019-01-03 1 2
4 1 2019-01-04 1 4
5 1 2019-01-05 1 4
6 1 2019-01-06 1 0
7 1 2019-01-07 1 2
8 1 2019-01-08 1 4
9 1 2019-01-09 1 2
10 1 2019-01-10 1 2
11 1 2019-01-11 2 4
12 1 2019-01-12 2 2
13 1 2019-01-13 2 3
14 1 2019-01-14 2 2
15 1 2019-01-15 2 0
16 1 2019-01-16 2 4
17 1 2019-01-17 2 1
18 1 2019-01-18 2 0
19 1 2019-01-19 2 1
20 1 2019-01-20 2 4
21 1 2019-01-21 3 4
22 1 2019-01-22 3 3
23 1 2019-01-23 3 3
24 1 2019-01-24 3 4
25 1 2019-01-25 3 3
26 1 2019-01-26 3 3
27 1 2019-01-27 3 2
28 1 2019-01-28 3 2
29 1 2019-01-29 3 1
30 1 2019-01-30 3 0
The expected results would have the following output. There are now merged time frames on lines 4, 8, 20, 22, and 23.
id start_date stop_date lab value
1 1 2019-01-01 2019-01-01 1 1
2 1 2019-01-02 2019-01-02 1 3
3 1 2019-01-03 2019-01-03 1 2
4 1 2019-01-04 2019-01-05 1 4
5 1 2019-01-06 2019-01-06 1 0
6 1 2019-01-07 2019-01-07 1 2
7 1 2019-01-08 2019-01-08 1 4
8 1 2019-01-09 2019-01-10 1 2
9 1 2019-01-11 2019-01-11 2 4
10 1 2019-01-12 2019-01-12 2 2
11 1 2019-01-13 2019-01-13 2 3
12 1 2019-01-14 2019-01-14 2 2
13 1 2019-01-15 2019-01-15 2 0
14 1 2019-01-16 2019-01-16 2 4
15 1 2019-01-17 2019-01-17 2 1
16 1 2019-01-18 2019-01-18 2 0
17 1 2019-01-19 2019-01-19 2 1
18 1 2019-01-20 2019-01-20 2 4
19 1 2019-01-21 2019-01-21 3 4
20 1 2019-01-22 2019-01-23 3 3
21 1 2019-01-24 2109-01-24 3 4
22 1 2019-01-25 2019-01-26 3 3
23 1 2019-01-27 2019-01-28 3 2
24 1 2019-01-29 2019-01-29 3 1
25 1 2019-01-30 2019-01-30 3 0
You could use data.table's rleid to help you with that:
library(dplyr)
library(data.table)
df %>%
select_all(funs(gsub(".*_", "", .))) %>%
group_by(grp = rleid(id, lab, value)) %>%
mutate(start_date = min(date),
stop_date = max(date)) %>%
ungroup() %>%
distinct(id, start_date, stop_date, lab, value)
Output:
id lab value start_date stop_date
1 1 1 1 2019-01-01 2019-01-01
2 1 1 3 2019-01-02 2019-01-02
3 1 1 2 2019-01-03 2019-01-03
4 1 1 4 2019-01-04 2019-01-05
5 1 1 0 2019-01-06 2019-01-06
6 1 1 2 2019-01-07 2019-01-07
7 1 1 4 2019-01-08 2019-01-08
8 1 1 2 2019-01-09 2019-01-10
9 1 2 4 2019-01-11 2019-01-11
10 1 2 2 2019-01-12 2019-01-12
11 1 2 3 2019-01-13 2019-01-13
12 1 2 2 2019-01-14 2019-01-14
13 1 2 0 2019-01-15 2019-01-15
14 1 2 4 2019-01-16 2019-01-16
15 1 2 1 2019-01-17 2019-01-17
16 1 2 0 2019-01-18 2019-01-18
17 1 2 1 2019-01-19 2019-01-19
18 1 2 4 2019-01-20 2019-01-20
19 1 3 4 2019-01-21 2019-01-21
20 1 3 3 2019-01-22 2019-01-23
21 1 3 4 2019-01-24 2019-01-24
22 1 3 3 2019-01-25 2019-01-26
23 1 3 2 2019-01-27 2019-01-28
24 1 3 1 2019-01-29 2019-01-29
25 1 3 0 2019-01-30 2019-01-30

R dplyr conditional sum with dynamic conditions

I need to calculate column "sum_other_users_30d" from the following dataset (example):
id user count date_start date_current **sum_other_users_30d**
1 1 3 2015-01-01 2015-01-07 16
1 1 2 2015-01-01 2015-01-10 16
1 1 5 2015-01-01 2015-01-20 16
1 1 1 2015-01-01 2015-02-22 16
1 2 1 2015-02-02 2015-01-15 3
1 2 1 2015-02-02 2015-01-10 3
1 2 6 2015-02-02 2015-01-30 3
1 2 2 2015-02-02 2015-02-22 3
1 3 1 2015-01-16 2015-01-17 14
1 3 1 2015-01-16 2015-01-31 14
1 3 6 2015-01-16 2015-01-30 14
1 3 2 2015-01-16 2015-02-22 14
The value of sum_other_users_30d for each observation is a sum of count of other user values (user != user in current obs), with date_current within 30 days from a given date_start (date_current - 30 <= date_start in current obs).
For example, in first line the sum of 16 is made up of following count values:
id user count date_start date_current sum_other_users_30d
1 2 1 2015-02-02 2015-01-15 3
1 2 1 2015-02-02 2015-01-10 3
1 2 6 2015-02-02 2015-01-30 3
1 3 1 2015-01-16 2015-01-17 14
1 3 1 2015-01-16 2015-01-31 14
1 3 6 2015-01-16 2015-01-30 14
I'm trying to do this in dplyr with mutate(), but I can't find a way to reference sum conditions to particular observation values (user unequal current user etc).
I'd be grateful for your help!

adding mising rows in R

I have a table in R as follows
month day total
1 1 3 1414
2 1 5 1070
3 1 6 211
4 1 7 2214
5 1 8 1766
6 1 13 2486
7 1 14 43
8 1 15 2349
9 1 16 4616
10 1 17 2432
11 1 18 1482
12 1 19 694
13 1 20 968
14 1 23 381
15 1 24 390
16 1 26 4063
17 1 27 3323
18 1 28 988
19 1 29 9671
20 1 30 11968
I have to add the missing values of days such as 1,2,4 with zero such that the result should be like
month day total
- 1 1 0
- 1 2 0
1 3 1414
1 4 0
2 1 5 1070
3 1 6 211
4 1 7 2214
5 1 8 1766
1 9 0
1 10 0
1 11 0
1 12 0
6 1 13 2486
7 1 14 43
8 1 15 2349
9 1 16 4616
10 1 17 2432
11 1 18 1482
12 1 19 694
13 1 20 968
1 21 0
1 22 0
14 1 23 381
15 1 24 390
1 25 0
16 1 26 4063
17 1 27 3323
18 1 28 988
19 1 29 9671
20 1 30 11968
Using only base R, you could do it this way:
for(d in 1:31) {
if(!d %in% my.df$day)
my.df[nrow(my.df) + 1,] <- c(1,d,0)
}
# Reorder rows
my.df <- my.df[with(my.df, order(month, day)),]
rownames(my.df) <- NULL
# Check the results
head(my.df)
# month day total
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1414
# 4 1 4 0
# 5 1 5 1070
# 6 1 6 211
In R, we could create a new dataset with 'day' column from 1:30 and 'month' as '1', left_join with the original dataset and replace the NA values after the merge with '0'
df2 <- data.frame(month=1, day=1:30)
library(dplyr)
left_join(df2, df1) %>%
mutate(total=replace(total, which(is.na(total)),0))
Or use merge from base R to get 'dM' and assign the NA values in the 'total' to '0'
dM <- merge(df1, df2, all.y=TRUE)
dM$total[is.na(dM$total)] <- 0

Resources