I have a data frame in the below format and I'm trying to find the difference in time between the Event 'ASSIGNED' and the last time the Event is 'CREATED' that comes before it.
**AccountID** **TIME** **EVENT**
1 2016-11-08T01:54:15.000Z CREATED
1 2016-11-09T01:54:15.000Z ASSIGNED
1 2016-11-10T01:54:15.000Z CREATED
1 2016-11-11T01:54:15.000Z CALLED
1 2016-11-12T01:54:15.000Z ASSIGNED
1 2016-11-12T01:54:15.000Z SLEEP
Currently my code is as follows, my difficulty is selecting the CREATED that just comes before the ASSIGNED Event
test <- timetable.filter %>%
group_by(AccountID) %>%
mutate(timeToAssign = ifelse(EVENT == 'ASSIGNED',
interval(ymd_hms(TIME), max(ymd_hms(TIME[EVENT == 'CREATED']))) %/% hours(1), NA))
I'm looking for the output to be
**AccountID** **TIME** **EVENT** **timeToAssign**
1 2016-11-08T01:54:15.000Z CREATED NA
1 2016-11-09T01:54:15.000Z ASSIGNED 12
1 2016-11-10T01:54:15.000Z CREATED NA
1 2016-11-11T01:54:15.000Z CALLED NA
1 2016-11-12T01:54:15.000Z ASSIGNED 24
1 2016-11-12T01:54:15.000Z SLEEP NA
With dplyr and tidyr:
library(dplyr); library(tidyr); library(anytime)
df %>%
group_by(AccountID) %>%
mutate(CREATED_INDEX = if_else(EVENT == 'CREATED', row_number(), NA_integer_),
TIME = anytime(TIME)) %>%
fill(CREATED_INDEX) %>%
mutate(TimeToAssign = if_else(EVENT == 'ASSIGNED',
as.numeric(TIME - TIME[CREATED_INDEX], units = 'hours'),
NA_real_)) %>%
select(-CREATED_INDEX)
# A tibble: 6 x 4
# Groups: AccountID [1]
# AccountID TIME EVENT TimeToAssign
# <int> <dttm> <fctr> <dbl>
#1 1 2016-11-08 01:54:15 CREATED NA
#2 1 2016-11-09 01:54:15 ASSIGNED 24
#3 1 2016-11-10 01:54:15 CREATED NA
#4 1 2016-11-11 01:54:15 CALLED NA
#5 1 2016-11-12 01:54:15 ASSIGNED 48
#6 1 2016-11-12 01:54:15 SLEEP NA
Related
Thank you, experts for previous answers (How to filter by range of dates in R?)
I am still having some problems dealing with the data.
Example:
id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
My idea is to eliminate the observations that have more than 3 "units" in a period of 30 days. That is, if "a" has a unit "q" on "12/02/2021" [dd/mm]yyyy]: (a) if between 12/01/2021 and 12/02/2021 there are already 3 observations it must be deleted . (b) If there are less than 3 this one must remain.
My expected result is:
p q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
With this code:
df <- df %>%
mutate(day = dmy(data))%>%
group_by(p) %>%
arrange(day, .by_group = TRUE) %>%
mutate(diff = day - first(day)) %>%
mutate(row = row_number()) %>%
filter(row <= 3 | !diff < 30)
But the result is:
P Q DATE DAY DIFF ROW
a 1 1/1/2021 1/1/2021 0 1
a 1 1/1/2021 1/1/2021 0 2
a 1 21/1/2021 21/1/2021 20 3
a 1 12/2/2021 12/2/2021 42 5
a 1 12/2/2021 12/2/2021 42 6
a 1 12/2/2021 12/2/2021 42 7
a 1 12/2/2021 12/2/2021 42 8
The main problem is that the diff variable must count days in periods of 30 days from the last day of the previous 30-days period - not since the first observation day.
Any help? Thanks
Using floor_date it is quite straighforward:
library(lubridate)
library(dplyr)
df %>%
group_by(floor = floor_date(date, '30 days')) %>%
slice_head(n = 3) %>%
ungroup() %>%
select(-floor)
# A tibble: 6 x 3
id q date
<chr> <int> <date>
1 a 1 2021-01-01
2 a 1 2021-01-01
3 a 1 2021-01-21
4 a 1 2021-02-12
5 a 1 2021-02-12
6 a 1 2021-02-12
data
df <- read.table(header = T, text = "id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021")
df$date<-as.Date(df$date, format = "%d/%m/%Y")
I have a big dataset of about 4 Milion rows.
the columns are
Idx - dog serial number
date - date of event YYYY-MM-DD ( 2016 till 2021)
Is_sterilized - 1 if the dog was sterilized and 0 if not sterilized.
each dog can appear many times in a year,
It can appear in 2016 and 2020 but not in 2017-2019.
I want to count how many dogs were sterilized each year, meaning, if a dog change from Is_serilized==0 to Is_sterilized ==1 in a year I count it as sterilized that year, the first year it appears sterilized counted as his year fo sterilization.
The issue is that my database is not clean and for some dogs goes from sterilized to not sterilized, this can not happen since sterilization is one-way ticket surgery.
It can happen that a dog appears sterilized, 3 years consecutive and then one year by mistake unsterilized and then sterilized for 2 years.
What I'm asking is if there is a logic that I can estimate/count how many dogs having the wrong direction.
And if so, how can I deduce those dogs from my dataset?
In the example data, Idx = A and C make sense but B and D does not make senese
df_test <- data.frame(Idx=c( 'A', 'B', 'B', 'B','A', 'A', 'C', 'C', 'D','D','D','D','D','D','C', 'C','A' ),
YEAR_date=as.Date(c("2016-01-01","2016-01-29","2017-01-01","2016-05-01","2016-05-06","2016-05-01","2016-03-03","2016-04-22","2018-05-05", "2017-02-01"," 2021-11-12"," 2019-09-13"," 2019-11-12"," 2019-08-17", "2011-09-01"," 2011-07-05","2021-01-05")),
Is_sterilized =c(0,1,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1)
)
df_test[,c( "Idx" ,"YEAR_date", "Is_sterilized")] %>% arrange(Idx ,YEAR_date )
Idx YEAR_date Is_sterilized
1 A 2016-01-01 0
2 A 2016-05-01 1
3 A 2016-05-06 1
4 A 2021-01-05 1
5 B 2016-01-29 1
6 B 2016-05-01 1
7 B 2017-01-01 0
8 C 2011-07-05 1
9 C 2011-09-01 1
10 C 2016-03-03 1
11 C 2016-04-22 1
12 D 2017-02-01 1
13 D 2018-05-05 1
14 D 2019-08-17 1
15 D 2019-09-13 1
16 D 2019-11-12 0
17 D 2021-11-12 0
I have more columns is if you thing anything else is relevant please write and I'll check I have it.
Any hint idea anything will be helpul
Thanks You in advance
Here's some dplyr code to identify instances where a dog's sterilization went from 1 to zero:
library(dplyr)
df_test %>%
group_by(Idx) %>%
mutate(change = Is_sterilized-lag(Is_sterilized, default = 0)) %>%
filter(change == -1) %>%
ungroup()
# A tibble: 3 x 4
Idx YEAR_date Is_sterilized change
<chr> <date> <dbl> <dbl>
1 B 2017-01-01 0 -1
2 D 2021-11-12 0 -1
3 D 2019-11-12 0 -1
If you want to count the number of dogs in that list, add %>% count(Idx) at the end.
df_test %>%
group_by(Idx) %>%
mutate(change = Is_sterilized-lag(Is_sterilized, default = 0)) %>%
filter(change == -1) %>%
ungroup() %>%
count(Idx, name = "times_desterilized")
# A tibble: 2 x 2
Idx times_desterilized
<chr> <int>
1 B 1
2 D 2
If I had:
person_ID visit date
1 2/25/2001
1 2/27/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
and I wanted another column to indicate the earliest recurring observation within 90 days, grouped by patient ID, with the desired output:
person_ID visit date date
1 2/25/2001 2/27/2001
1 2/27/2001 4/2/2001
1 4/2/2001 NA
2 3/18/2004 NA
3 9/22/2004 10/27/2004
3 10/27/2004 NA
3 5/15/2008 NA
Thank you!
We convert the 'visit_date' to Date class, grouped by 'person_ID', create a binary column that returns 1 if the difference between the current and next visit_date is less than 90 or else 0, using this column, get the correponding next visit_date' where the value is 1
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(i1 = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0),
date = case_when(as.logical(i1)~ lead(visit_date)), i1 = NULL ) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date date
# <int> <date> <date>
#1 1 2001-02-25 2001-02-27
#2 1 2001-02-27 2001-04-02
#3 1 2001-04-02 NA
#4 2 2004-03-18 NA
#5 3 2004-09-22 2004-10-27
#6 3 2004-10-27 NA
#7 3 2008-05-15 NA
I have a dataframe structured like so:
example <- data.frame(id = c(1,1,1,2,2,2,3,3,3),
delivereddate = c("7/20/2019","7/24/2019","7/28/2019","3/24/2019","4/13/2019","4/25/2019","11/13/2019","11/20/2019","11/27/2019"),
applieddate = c("7/22/2019","7/22/2019","7/22/2019",NA,NA,NA,"11/21/2019","11/21/2019","11/21/2019"))
I am attempting to add a column that identifies the most recent deliverdate before the applieddate for each id. An example of what I'm trying to get for the final result is the following:
desiredresult <- data.frame(id = c(1,1,1,2,2,2,3,3,3),
delivereddate = c("7/20/2019","7/24/2019","7/28/2019","3/24/2019","4/13/2019","4/25/2019","11/13/2019","11/20/2019","11/27/2019"),
applieddate = c("7/22/2019","7/22/2019","7/22/2019",NA,NA,NA,"11/21/2019","11/21/2019","11/21/2019"),
applied = c(1,0,0,0,0,0,0,1,0))
I need the applied column to be binary(0 or 1) and there can only be 1 row per id with a 1 flag. If an id has no applieddate, then the applied flag is 0 for all rows.
We could use findInterval
library(dplyr)
library(lubridate)
example %>%
dplyr::group_by(id) %>%
dplyr::mutate(applied = +(row_number() %in%
findInterval(lubridate::mdy(first(applieddate)),
lubridate::mdy(delivereddate))))
# A tibble: 9 x 4
# Groups: id [3]
# id delivereddate applieddate applied
# <dbl> <chr> <chr> <int>
#1 1 7/20/2019 7/22/2019 1
#2 1 7/24/2019 7/22/2019 0
#3 1 7/28/2019 7/22/2019 0
#4 2 3/24/2019 <NA> 0
#5 2 4/13/2019 <NA> 0
#6 2 4/25/2019 <NA> 0
#7 3 11/13/2019 11/21/2019 0
#8 3 11/20/2019 11/21/2019 1
#9 3 11/27/2019 11/21/2019 0
You can convert the columns to date class, subtract applieddate from delivereddate and take the absolute value. For each id we then assign 1 to index where minimum difference is observed.
library(dplyr)
example %>%
mutate(across(ends_with('date'), lubridate::mdy),
applied = abs(delivereddate - applieddate)) %>%
group_by(id) %>%
mutate(applied = +(row_number() %in% which.min(applied)))
# id delivereddate applieddate applied
# <dbl> <date> <date> <int>
#1 1 2019-07-20 2019-07-22 1
#2 1 2019-07-24 2019-07-22 0
#3 1 2019-07-28 2019-07-22 0
#4 2 2019-03-24 NA 0
#5 2 2019-04-13 NA 0
#6 2 2019-04-25 NA 0
#7 3 2019-11-13 2019-11-21 0
#8 3 2019-11-20 2019-11-21 1
#9 3 2019-11-27 2019-11-21 0
I have a data in which I have 2 fields in a table sf -> Customer id and Buy_date. Buy_date is unique but for each customer, but there can be more than 3 different values of Buy_dates for each customer. I want to calculate difference in consecutive Buy_date for each Customer and its mean value. How can I do this.
Example
Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13
I want the results for each customer in the format
Customer mean
Here's a dplyr solution.
Your data:
df <- data.frame(Customer = c(1,1,1,1,2,2,2), Buy_date = c("2018/03/01", "2018/03/19", "2018/04/3", "2018/05/10", "2018/01/02", "2018/02/10", "2018/04/13"))
Grouping, mean Buy_date calculation and summarising:
library(dplyr)
df %>% group_by(Customer) %>% mutate(mean = mean(as.POSIXct(Buy_date))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <dttm>
1 1 2018-03-31 06:30:00
2 2 2018-02-17 15:40:00
Or as #r2evans points out in his comment for the consecutive days between Buy_dates:
df %>% group_by(Customer) %>% mutate(mean = mean(diff(as.POSIXct(Buy_date)))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <time>
1 1 23.3194444444444
2 2 50.4791666666667
I am not exactly sure of the desired output but this what I think you want.
library(dplyr)
library(zoo)
dat <- read.table(text =
"Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13", header = T, stringsAsFactors = F)
dat$Buy_date <- as.Date(dat$Buy_date)
dat %>% group_by(Customer) %>% mutate(diff_between = as.vector(diff(zoo(Buy_date), na.pad=TRUE)),
mean_days = mean(diff_between, na.rm = TRUE))
This produces:
Customer Buy_date diff_between mean_days
<int> <date> <dbl> <dbl>
1 1 2018-03-01 NA 23.3
2 1 2018-03-19 18 23.3
3 1 2018-04-03 15 23.3
4 1 2018-05-10 37 23.3
5 2 2018-01-02 NA 50.5
6 2 2018-02-10 39 50.5
7 2 2018-04-13 62 50.5
EDITED BASED ON USER COMMENTS:
Because you said that you have factors and not characters just convert them by doing the following:
dat$Buy_date <- as.Date(as.character(dat$Buy_date))
dat$Customer <- as.character(dat$Customer)