I need to calculate column "sum_other_users_30d" from the following dataset (example):
id user count date_start date_current **sum_other_users_30d**
1 1 3 2015-01-01 2015-01-07 16
1 1 2 2015-01-01 2015-01-10 16
1 1 5 2015-01-01 2015-01-20 16
1 1 1 2015-01-01 2015-02-22 16
1 2 1 2015-02-02 2015-01-15 3
1 2 1 2015-02-02 2015-01-10 3
1 2 6 2015-02-02 2015-01-30 3
1 2 2 2015-02-02 2015-02-22 3
1 3 1 2015-01-16 2015-01-17 14
1 3 1 2015-01-16 2015-01-31 14
1 3 6 2015-01-16 2015-01-30 14
1 3 2 2015-01-16 2015-02-22 14
The value of sum_other_users_30d for each observation is a sum of count of other user values (user != user in current obs), with date_current within 30 days from a given date_start (date_current - 30 <= date_start in current obs).
For example, in first line the sum of 16 is made up of following count values:
id user count date_start date_current sum_other_users_30d
1 2 1 2015-02-02 2015-01-15 3
1 2 1 2015-02-02 2015-01-10 3
1 2 6 2015-02-02 2015-01-30 3
1 3 1 2015-01-16 2015-01-17 14
1 3 1 2015-01-16 2015-01-31 14
1 3 6 2015-01-16 2015-01-30 14
I'm trying to do this in dplyr with mutate(), but I can't find a way to reference sum conditions to particular observation values (user unequal current user etc).
I'd be grateful for your help!
Related
I want to keep an observation (grouped by ID) for every 30 days. I want to do this by creating a variable that tells me which observations are left inside (1) and which ones are outside (0) of the filter.
Example
id date
1 3/1/2021
1 4/1/2021
1 5/1/2021
1 6/1/2021
1 2/2/2021
1 3/2/2021
1 5/2/2021
1 7/2/2021
1 9/2/2021
1 11/2/2021
1 13/2/2021
1 16/3/2021
2 5/1/2021
2 31/10/2021
2 9/1/2021
2 6/2/2021
2 1/6/2021
3 1/1/2021
3 1/6/2021
3 31/12/2021
4 5/5/2021
Expected result
id date count
1 3/1/2021 1
1 4/1/2021 0
1 5/1/2021 0
1 6/1/2021 0
1 2/2/2021 0
1 3/2/2021 1
1 5/2/2021 0
1 7/2/2021 0
1 9/2/2021 0
1 11/2/2021 0
1 13/2/2021 0
1 16/3/2021 1
2 5/1/2021 1
2 31/10/2021 1
2 9/1/2021 0
2 6/2/2021 1
2 1/6/2021 1
3 1/1/2021 1
3 1/6/2021 1
3 31/12/2021 1
4 5/5/2021 1
here is a data.table approach
library(data.table)
# sort by id by date
setkey(DT, id, date)
# create groups
DT[, group := rleid((as.numeric(date - date[1])) %/% 30), by = .(id)][]
# create count column
DT[, count := ifelse(!group == shift(group, type = "lag", fill = 0), 1, 0), by = .(id)][]
# id date group count
# 1: 1 2021-01-03 1 1
# 2: 1 2021-01-04 1 0
# 3: 1 2021-01-05 1 0
# 4: 1 2021-01-06 1 0
# 5: 1 2021-02-02 2 1
# 6: 1 2021-02-03 2 0
# 7: 1 2021-02-05 2 0
# 8: 1 2021-02-07 2 0
# 9: 1 2021-02-09 2 0
#10: 1 2021-02-11 2 0
#11: 1 2021-02-13 2 0
#12: 1 2021-03-16 3 1
#13: 2 2021-01-05 1 1
#14: 2 2021-01-09 1 0
#15: 2 2021-02-06 2 1
#16: 2 2021-06-01 3 1
#17: 2 2021-10-31 4 1
#18: 3 2021-01-01 1 1
#19: 3 2021-06-01 2 1
#20: 3 2021-12-31 3 1
#21: 4 2021-05-05 1 1
# id date group count
sample data used
DT <- fread("id date
1 3/1/2021
1 4/1/2021
1 5/1/2021
1 6/1/2021
1 2/2/2021
1 3/2/2021
1 5/2/2021
1 7/2/2021
1 9/2/2021
1 11/2/2021
1 13/2/2021
1 16/3/2021
2 5/1/2021
2 31/10/2021
2 9/1/2021
2 6/2/2021
2 1/6/2021
3 1/1/2021
3 1/6/2021
3 31/12/2021
4 5/5/2021")
# set date as actual date
DT[, date := as.Date(date, "%d/%m/%Y")]
I need some help with my R code - I've been trying to get it to work for ages and i'm totally stuck.
I have a large dataset (~40000 rows) and I need to assign group IDs to a new column based on a condition of another column. So if df$flow.type==1 then then that [SITENAME, SAMPLING.YEAR, cluster] group should be assigned with a unique group ID. This is an example:
This is a similar question but for SQL: Assigning group number based on condition. I need a way to do this in R - sorry I am a novice at if_else and loops. The below code is the best I could come up with but it isn't working. Can anyone see what i'm doing wrong?
thanks in advance for your help
if(flow.type.test=="0"){
event.samp.num.test <- "1000"
} else (flow.type.test=="1"){
event.samp.num.test <- Sample_dat %>% group_by(SITENAME, SAMPLING.YEAR, cluster) %>% tally()}
Note the group ID '1000' is just a random impossible number for this dataset - it will be used to subset the data later on.
My subset df looks like this:
> str(dummydat)
'data.frame': 68 obs. of 6 variables:
$ SITENAME : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
$ SAMPLING.YEAR: Factor w/ 4 levels "1","2","3","4": 3 3 3 3 3 3 3 3 3 4 ...
$ DATE : Date, format: "2017-10-17" "2017-10-17" "2017-10-22" "2017-11-28" ...
$ TIME : chr "10:45" "15:00" "15:20" "20:59" ...
$ flow.type : int 1 1 0 0 1 1 0 0 0 1 ...
$ cluster : int 1 1 2 3 4 4 5 6 7 8 ...
Sorry I tried dput but the output is horrendous. I have subset 40 rows of the subset data below as an example, I hope this is okay.
> head(dummydat, n=40)
SITENAME SAMPLING.YEAR DATE TIME flow.type cluster
1 A 3 2017-10-17 10:45 1 1
2 A 3 2017-10-17 15:00 1 1
3 A 3 2017-10-22 15:20 0 2
4 A 3 2017-11-28 20:59 0 3
5 A 3 2017-12-05 18:15 1 4
6 A 3 2017-12-06 8:25 1 4
7 A 3 2017-12-10 10:05 0 5
8 A 3 2017-12-15 15:12 0 6
9 A 3 2017-12-19 17:40 0 7
10 A 4 2018-12-09 18:10 1 8
11 A 4 2018-12-16 10:35 0 9
12 A 4 2018-12-26 6:47 0 10
13 A 4 2019-01-01 14:25 0 11
14 A 4 2019-01-05 16:40 0 12
15 A 4 2019-01-12 7:42 0 13
16 A 4 2019-01-20 16:15 0 14
17 A 4 2019-01-28 10:41 0 15
18 A 4 2019-02-03 16:30 1 16
19 A 4 2019-02-04 17:14 1 16
20 B 1 2015-12-24 6:21 1 16
21 B 1 2015-12-29 17:41 1 17
22 B 1 2015-12-29 23:33 1 17
23 B 1 2015-12-30 5:17 1 17
24 B 1 2015-12-30 17:23 1 17
25 B 1 2015-12-31 5:29 1 17
26 B 1 2015-12-31 11:35 1 17
27 B 1 2015-12-31 23:40 1 17
28 B 1 2016-02-09 10:53 0 18
29 B 1 2016-03-03 15:23 1 19
30 B 1 2016-03-03 17:37 1 19
31 B 1 2016-03-03 21:33 1 19
32 B 1 2016-03-04 3:17 1 19
33 B 2 2017-01-07 13:16 1 20
34 B 2 2017-01-07 22:24 1 20
35 B 2 2017-01-08 6:34 1 20
36 B 2 2017-01-08 11:42 1 20
37 B 2 2017-01-08 20:50 1 20
38 B 2 2017-01-31 11:39 1 21
39 B 2 2017-01-31 16:45 1 21
40 B 2 2017-01-31 22:53 1 21
Here is one approach with tidyverse:
library(dplyr)
library(tidyr)
left_join(df, df %>%
filter(flow.type == 1) %>%
group_by(SITENAME, SAMPLING.YEAR) %>%
mutate(group.ID = cumsum(cluster != lag(cluster, default = first(cluster))) + 1)) %>%
mutate(group.ID = replace_na(group.ID, 1000))
First, filter rows that have flow.type of 1. Then, group_by both SITENAME and SAMPLING.YEAR to count groups within those same characteristics. Next, use cumsum for cumulative sum of when cluster value changes - this will be the group number. This will be merged back with original data (left_join). To have those with flow.type 0 become 1000 for group.ID, you can use replace_na.
Output
SITENAME SAMPLING.YEAR DATE TIME flow.type cluster group.ID
1 A 3 2017-10-17 10:45 1 1 1
2 A 3 2017-10-17 15:00 1 1 1
3 A 3 2017-10-22 15:20 0 2 1000
4 A 3 2017-11-28 20:59 0 3 1000
5 A 3 2017-12-05 18:15 1 4 2
6 A 3 2017-12-06 8:25 1 4 2
7 A 3 2017-12-10 10:05 0 5 1000
8 A 3 2017-12-15 15:12 0 6 1000
9 A 3 2017-12-19 17:40 0 7 1000
10 A 4 2018-12-09 18:10 1 8 1
11 A 4 2018-12-16 10:35 0 9 1000
12 A 4 2018-12-26 6:47 0 10 1000
13 A 4 2019-01-01 14:25 0 11 1000
14 A 4 2019-01-05 16:40 0 12 1000
15 A 4 2019-01-12 7:42 0 13 1000
16 A 4 2019-01-20 16:15 0 14 1000
17 A 4 2019-01-28 10:41 0 15 1000
18 A 4 2019-02-03 16:30 1 16 2
19 A 4 2019-02-04 17:14 1 16 2
20 B 1 2015-12-24 6:21 1 16 1
21 B 1 2015-12-29 17:41 1 17 2
22 B 1 2015-12-29 23:33 1 17 2
23 B 1 2015-12-30 5:17 1 17 2
24 B 1 2015-12-30 17:23 1 17 2
25 B 1 2015-12-31 5:29 1 17 2
26 B 1 2015-12-31 11:35 1 17 2
27 B 1 2015-12-31 23:40 1 17 2
28 B 1 2016-02-09 10:53 0 18 1000
29 B 1 2016-03-03 15:23 1 19 3
30 B 1 2016-03-03 17:37 1 19 3
31 B 1 2016-03-03 21:33 1 19 3
32 B 1 2016-03-04 3:17 1 19 3
33 B 2 2017-01-07 13:16 1 20 1
34 B 2 2017-01-07 22:24 1 20 1
35 B 2 2017-01-08 6:34 1 20 1
36 B 2 2017-01-08 11:42 1 20 1
37 B 2 2017-01-08 20:50 1 20 1
38 B 2 2017-01-31 11:39 1 21 2
39 B 2 2017-01-31 16:45 1 21 2
40 B 2 2017-01-31 22:53 1 21 2
Here is a data.table approach
library(data.table)
setDT(df)[
, group.ID := 1000
][
flow.type == 1, group.ID := copy(.SD)[, grp := .GRP, by = cluster]$grp,
by = .(SITENAME, SAMPLING.YEAR)
]
Output
> df[]
SITENAME SAMPLING.YEAR DATE TIME flow.type cluster group.ID
1: A 3 2017-10-17 10:45:00 1 1 1
2: A 3 2017-10-17 15:00:00 1 1 1
3: A 3 2017-10-22 15:20:00 0 2 1000
4: A 3 2017-11-28 20:59:00 0 3 1000
5: A 3 2017-12-05 18:15:00 1 4 2
6: A 3 2017-12-06 08:25:00 1 4 2
7: A 3 2017-12-10 10:05:00 0 5 1000
8: A 3 2017-12-15 15:12:00 0 6 1000
9: A 3 2017-12-19 17:40:00 0 7 1000
10: A 4 2018-12-09 18:10:00 1 8 1
11: A 4 2018-12-16 10:35:00 0 9 1000
12: A 4 2018-12-26 06:47:00 0 10 1000
13: A 4 2019-01-01 14:25:00 0 11 1000
14: A 4 2019-01-05 16:40:00 0 12 1000
15: A 4 2019-01-12 07:42:00 0 13 1000
16: A 4 2019-01-20 16:15:00 0 14 1000
17: A 4 2019-01-28 10:41:00 0 15 1000
18: A 4 2019-02-03 16:30:00 1 16 2
19: A 4 2019-02-04 17:14:00 1 16 2
20: B 1 2015-12-24 06:21:00 1 16 1
21: B 1 2015-12-29 17:41:00 1 17 2
22: B 1 2015-12-29 23:33:00 1 17 2
23: B 1 2015-12-30 05:17:00 1 17 2
24: B 1 2015-12-30 17:23:00 1 17 2
25: B 1 2015-12-31 05:29:00 1 17 2
26: B 1 2015-12-31 11:35:00 1 17 2
27: B 1 2015-12-31 23:40:00 1 17 2
28: B 1 2016-02-09 10:53:00 0 18 1000
29: B 1 2016-03-03 15:23:00 1 19 3
30: B 1 2016-03-03 17:37:00 1 19 3
31: B 1 2016-03-03 21:33:00 1 19 3
32: B 1 2016-03-04 03:17:00 1 19 3
33: B 2 2017-01-07 13:16:00 1 20 1
34: B 2 2017-01-07 22:24:00 1 20 1
35: B 2 2017-01-08 06:34:00 1 20 1
36: B 2 2017-01-08 11:42:00 1 20 1
37: B 2 2017-01-08 20:50:00 1 20 1
38: B 2 2017-01-31 11:39:00 1 21 2
39: B 2 2017-01-31 16:45:00 1 21 2
40: B 2 2017-01-31 22:53:00 1 21 2
SITENAME SAMPLING.YEAR DATE TIME flow.type cluster group.ID
I'm trying to create a variable that shows the number of days since a particular event occurred. This is a follow up to this previous question, using the same data.
The data looks like this (note dates are in DD-MM-YYYY format):
ID date drug score
A 28/08/2016 2 3
A 29/08/2016 1 4
A 30/08/2016 2 4
A 2/09/2016 2 4
A 3/09/2016 1 4
A 4/09/2016 2 4
B 8/08/2016 1 3
B 9/08/2016 2 4
B 10/08/2016 2 3
B 11/08/2016 1 3
C 30/11/2016 2 4
C 2/12/2016 1 5
C 3/12/2016 2 1
C 5/12/2016 1 4
C 6/12/2016 2 4
C 8/12/2016 1 2
C 9/12/2016 1 2
For 'drug': 1=drug taken, 2=no drug taken.
Each time the value of drug is 1, if that ID has a previous record that is also drug==1, then I need to generate a new value 'lagtime' that shows the number of days (not the number of rows!) since the previous time the drug was taken.
So the output I am looking for is:
ID date drug score lagtime
A 28/08/2016 2 3
A 29/08/2016 1 4
A 30/08/2016 2 4
A 2/09/2016 2 4
A 3/09/2016 1 4 5
A 4/09/2016 2 4
B 8/08/2016 1 3
B 9/08/2016 2 4
B 10/08/2016 2 3
B 11/08/2016 1 3 3
C 30/11/2016 2 4
C 2/12/2016 1 5
C 3/12/2016 2 1
C 5/12/2016 1 4 3
C 6/12/2016 2 4
C 8/12/2016 1 2 3
C 9/12/2016 1 2 1
So I need a way to generate (mutate?) this lagtime score that is calculated as the date for each drug==1 record, minus the date of the previous drug==1 record, grouped by ID.
This has me completely bamboozled.
Here's code for the example data:
data<-data.frame(ID=c("A","A","A","A","A","A","B","B","B","B","C","C","C","C","C","C","C"),
date=as.Date(c("28/08/2016","29/08/2016","30/08/2016","2/09/2016","3/09/2016","4/09/2016","8/08/2016","9/08/2016","10/08/2016","11/08/2016","30/11/2016","2/12/2016","3/12/2016","5/12/2016","6/12/2016","8/12/2016","9/12/2016"),format= "%d/%m/%Y"),
drug=c(2,1,2,2,1,2,1,2,2,1,2,1,2,1,2,1,1),
score=c(3,4,4,4,4,4,3,4,3,3,4,5,1,4,4,2,2))
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'ID', specify the i (drug ==1), get the difference of 'date' (diff(date)), concatenate with NA as the diff output length is 1 less than the original vector, convert to integer and assign (:=) to create the 'lagtime'. By default, all other values will be NA
library(data.table)
setDT(data)[drug==1, lagtime := as.integer(c(NA, diff(date))), ID]
data
# ID date drug score lagtime
# 1: A 2016-08-28 2 3 NA
# 2: A 2016-08-29 1 4 NA
# 3: A 2016-08-30 2 4 NA
# 4: A 2016-09-02 2 4 NA
# 5: A 2016-09-03 1 4 5
# 6: A 2016-09-04 2 4 NA
# 7: B 2016-08-08 1 3 NA
# 8: B 2016-08-09 2 4 NA
# 9: B 2016-08-10 2 3 NA
#10: B 2016-08-11 1 3 3
#11: C 2016-11-30 2 4 NA
#12: C 2016-12-02 1 5 NA
#13: C 2016-12-03 2 1 NA
#14: C 2016-12-05 1 4 3
#15: C 2016-12-06 2 4 NA
#16: C 2016-12-08 1 2 3
#17: C 2016-12-09 1 2 1
I have a dataset as below:
the outcome have no relationship with contact_date, when a subscriber response a cold call, we mark it successful contact attempt(1) else (0). The count is how many times we called the subscriber.
subscriber_id outcome contact_date queue multiple_number count
(int) (int) (date) (fctr) (int) (int)
1 1 1 2015-01-29 2 1 1
2 1 0 2015-02-21 2 1 2
3 1 0 2015-03-29 2 1 3
4 1 1 2015-04-30 2 1 4
5 2 0 2015-01-29 2 1 1
6 2 0 2015-02-21 2 1 2
7 2 0 2015-03-29 2 1 3
8 2 0 2015-04-30 2 1 4
9 2 1 2015-05-31 2 1 5
10 2 1 2015-08-25 5 1 6
11 2 0 2015-10-30 5 1 7
12 2 0 2015-12-14 5 1 8
13 3 1 2015-01-29 2 1 1
I would like to get the count number for the first outcome ==1 for each subscriber, could you please tell me how can I get it? the final data set I would like is:
(Please noticed some may don't have any success call, in this case, I would like to mark the first_success as 0)
subscriber_id first_success
1 1
2 5
3 1
...
require(dplyr)
data %>% group_by(subscriber_id) %>% filter(outcome==1) %>%
slice(which.min(contact_date)) %>% data.frame() %>%
select(subscriber_id,count)
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 7 years ago.
So I have these two datasets:
ID DOB ID2 count
1 4083 2007-10-01 3625 5
2 4408 2008-07-01 3603 2
3 4514 2007-07-01 3077 3
4 4396 2008-05-01 3413 5
5 4222 2003-12-01 3341 1
6 4291 2000-07-01 3201 5
7 4581 2005-07-01 3836 1
8 4487 2007-01-01 3264 5
9 4916 2009-10-01 3825 1
10 4277 2000-04-01 3381 2
ID DOB score1 score2 score3 score4 score5 score6
4291 2000-07-01 2 5 2 2 1 2
4323 2000-07-01 3 3 1 4 2 5
4408 2008-07-01 4 2 5 5 3 5
4222 2003-12-01 2 1 3 2 3 3
4581 2005-07-01 5 1 5 2 3 1
4005 2003-06-01 1 4 2 4 5 3
4718 2009-02-01 2 3 1 5 5 5
4396 2008-05-01 3 5 2 2 2 5
4924 2008-02-01 5 5 4 5 5 4
4083 2007-10-01 4 5 1 3 3 4
4099 2000-05-01 4 3 1 2 1 2
4277 2000-04-01 2 2 1 3 1 1
4487 2007-01-01 2 5 2 4 3 5
4514 2007-07-01 1 3 4 3 1 5
4003 2005-07-01 3 3 4 1 1 3
4366 2008-12-01 4 4 4 4 3 4
4790 2009-07-01 1 3 1 3 1 4
4643 2002-03-01 3 2 3 3 4 3
4475 2009-05-01 1 4 3 3 3 3
4916 2009-10-01 5 1 3 1 2 2
Within dataset2 there are the ID and Dobs from dataset1, along with other rows of IDs of subjects Im not interested in. What I would like to do is to extract the IDs present in both datasets and create a dataset with the "ID2" column from dataset 1 and the other columns from dataset 2. Like so:
ID DOB ID2 score1 score2 score3 score4 score5 score6
4394 2004-11-01 3625 2 2 4 2 2 3
4181 2002-04-01 3603 3 1 3 2 2 5
4942 2001-08-01 3077 3 3 5 3 1 5
4765 2003-05-01 3413 1 5 5 2 3 2
4517 2003-03-01 3341 1 2 1 4 1 5
4905 2002-12-01 3201 5 2 4 1 1 5
4636 2004-07-01 3836 3 1 1 4 4 4
4179 2004-08-01 3264 5 2 5 5 4 2
4448 2007-11-01 3825 2 3 5 4 2 4
4218 2006-04-01 3381 1 5 3 4 5 3
I think the merge function comes into play here but for the life of me I cant seem to get it to work so any help you can give me will be gratefully received.
does this answer your need ?
merge(df1, df2, by.x=c("ID","DOB"))