I have data which looks like
library(dplyr)
library(lubridate)
Date_Construct= c("10/03/2018 00:00", "10/03/2018 00:00","01/01/2016 00:00","21/03/2015 01:25", "21/03/2015 01:25", "17/04/2016 00:00","17/04/2016 00:00", "20/02/2012 00:00","20/02/2020 00:00")
Date_first_use = c("02/08/2018 00:00","02/08/2018 00:00", "01/04/2016 00:00","NA", "NA", "NA", "NA","13/08/2012 00:00","20/04/2020 00:00")
Date_fail = c("02/08/2019 00:00","02/08/2019 00:00", "21/06/2018 06:42","NA" , "NA" , "17/04/2016 00:00", "17/04/2016 00:00","13/08/2014 07:45","NA")
P_ID = c("0001", "0001" ,"0001" ,"0001", "0001","34000","34000","34000", "00425")
Comp_date= c("16/05/2019 00:00", "10/04/2018 12:55","25/06/2017 00:00","22/04/2015 00:00","08/05/2015 00:00" ,"04/05/2017 00:00" ,"15/07/2016 00:00","01/03/2014 00:00", "20/03/2020 00:00")
Type = c("a","a","b","c","c","b","b","a","c")
Date_Construct= dmy_hm(Date_Construct)
dfq= data.frame(`P_ID`, `Type`, `Date_Construct`, `Date_first_use`,`Date_fail`, `Comp_date`)%>%
arrange(P_ID, desc(Date_Construct))%>%
group_by( P_ID, Date_Construct, Type)%>%
mutate(A_ID= cur_group_id())%>%
select(P_ID,A_ID,Type, Date_Construct, Date_first_use, Date_fail, Comp_date)%>%
mutate(across(contains("Date", ignore.case = TRUE), dmy_hm))
View(dfq)
It is a data frame of different items (A_ID) of type a/b/c, created for different clients (P_ID), with date of construction, date of first use and date of failure. Each P_ID may have multiple A_ID, and each A_ID may have multiple Comp_date.
I need to supply a date for where Date_fail is NA, which is the Date_construct of the next constructed A_ID for the same P_ID.
i.e. Date_fail for P_ID 0001, A_ID 1 should be 2016-01-01 00:00:00.
For A_ID which there are no subsequent A_ID (as is the case for P_ID 00425, A_ID 4), the Date_fail should remain NA .
So result should look like:
P_ID A_ID Type Date_Construct Date_first_use Date_fail Comp_date
1 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-04-22 00:00:00
2 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-05-08 00:00:00
3 0001 2 b 2016-01-01 00:00:00 2016-04-01 2018-06-21 06:42:00 2017-06-25 00:00:00
4 0001 3 a 2018-03-10 00:00:00 2018-08-02 2019-08-02 00:00:00 2019-05-16 00:00:00
5 0001 3 a 2018-03-10 00:00:00 2018-08-02 2019-08-02 00:00:00 2018-04-10 12:55:00
6 00425 4 c 2020-02-20 00:00:00 2020-04-20 NA 2020-03-20 00:00:00
7 34000 5 a 2012-02-20 00:00:00 2012-08-13 2014-08-13 07:45:00 2014-03-01 00:00:00
8 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2017-05-04 00:00:00
9 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2016-07-15 00:00:00
I tried this, which I thought worked, but it is just given me the Date_Construct of the next row in the group, which isn't correct as some A_ID have multiple entries:
arrange(P_ID, Date_Construct)%>%
group_by(P_ID,) %>%
mutate(Date_fail2 = sort(Date_Construct, decreasing = FALSE)[row_number(Date_Construct) + 1])%>%
mutate(Date_fail = if_else( is.na(Date_fail), paste(Date_fail2), paste(Date_fail)))
I'm ideally looking for a dplyr solution as I find them easier to understand and reproduce.
One solution is to nest all the variables that can be different for the same A_ID. (In this case only Comp_date)
library(tidyr)
nested = dfq %>%
ungroup() %>%
arrange(P_ID, A_ID) %>%
nest(extra = Comp_date)
This results in a tibble with one row for each A_ID, where the different Comp_dates are comfortably nested in their own tibbles:
> nested
# A tibble: 6 x 7
# Groups: P_ID, Type, Date_Construct [6]
P_ID A_ID Type Date_Construct Date_first_use Date_fail extra
<fct> <int> <fct> <dttm> <dttm> <dttm> <list>
1 0001 1 c 2015-03-21 01:25:00 NA NA <tibble [2 × 1]>
2 0001 2 b 2016-01-01 00:00:00 2016-04-01 00:00:00 2018-06-21 06:42:00 <tibble [1 × 1]>
3 0001 3 a 2018-03-10 00:00:00 2018-08-02 00:00:00 2019-08-02 00:00:00 <tibble [2 × 1]>
4 00425 4 c 2020-02-20 00:00:00 2020-04-20 00:00:00 NA <tibble [1 × 1]>
5 34000 5 a 2012-02-20 00:00:00 2012-08-13 00:00:00 2014-08-13 07:45:00 <tibble [1 × 1]>
6 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 <tibble [2 × 1]>
You can now modify this using normal dplyr methods. Your own approach would probably work as well here, but it can be done much more cleanly using coalesce and lead. Don't forget to unnest at the end to get your original structure back:
result = nested %>%
group_by(P_ID) %>%
mutate(Date_fail = coalesce(Date_fail, lead(Date_Construct))) %>%
unnest(extra)
Result:
> result
# A tibble: 9 x 7
# Groups: P_ID [3]
P_ID A_ID Type Date_Construct Date_first_use Date_fail Comp_date
<fct> <int> <fct> <dttm> <dttm> <dttm> <dttm>
1 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-04-22 00:00:00
2 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-05-08 00:00:00
3 0001 2 b 2016-01-01 00:00:00 2016-04-01 00:00:00 2018-06-21 06:42:00 2017-06-25 00:00:00
4 0001 3 a 2018-03-10 00:00:00 2018-08-02 00:00:00 2019-08-02 00:00:00 2019-05-16 00:00:00
5 0001 3 a 2018-03-10 00:00:00 2018-08-02 00:00:00 2019-08-02 00:00:00 2018-04-10 12:55:00
6 00425 4 c 2020-02-20 00:00:00 2020-04-20 00:00:00 NA 2020-03-20 00:00:00
7 34000 5 a 2012-02-20 00:00:00 2012-08-13 00:00:00 2014-08-13 07:45:00 2014-03-01 00:00:00
8 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2017-05-04 00:00:00
9 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2016-07-15 00:00:00
Related
I'm trying to calculate a rolling window in a fixed time interval. Suppose that the interval is 48 hours. I would like to get every data point that is contained between the date of the current observation and 48 hours before that observation. For example, if the datetime of the current observation is 05-07-2022 14:15:28, for that position, I would like a count value for every occurence between that date and 03-07-2022 14:15:28. Seconds are not fundamental to the analysis.
library(tidyverse)
library(lubridate)
df = tibble(id = 1:7,
date_time = ymd_hm('2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'))
# A tibble: 7 × 2
id date_time
<int> <dttm>
1 1 2022-05-07 15:00:00
2 2 2022-05-09 13:45:00
3 3 2022-05-09 13:51:00
4 4 2022-05-09 17:00:00
5 5 2022-05-10 15:25:00
6 6 2022-05-10 17:18:00
7 7 2022-05-11 14:00:00
With the example window of 48 hours, that would yield:
# A tibble: 7 × 4
id date_time lag_48hours count
<int> <dttm> <dttm> <dbl>
1 1 2022-05-07 15:00:00 2022-05-05 15:00:00 1
2 2 2022-05-09 13:45:00 2022-05-07 13:45:00 2
3 3 2022-05-09 13:51:00 2022-05-07 13:51:00 3
4 4 2022-05-09 17:00:00 2022-05-07 17:00:00 3
5 5 2022-05-10 15:25:00 2022-05-08 15:25:00 4
6 6 2022-05-10 17:18:00 2022-05-08 17:18:00 5
7 7 2022-05-11 14:00:00 2022-05-09 14:00:00 4
I added the lag column for illustration purposes. Any idea how to obtain the count column? I need to be able to adjust the window (48 hours in this example).
I'd encourage you to use slider, which allows you to do rolling window analysis using an irregular index.
library(tidyverse)
library(lubridate)
library(slider)
df = tibble(
id = 1:7,
date_time = ymd_hm(
'2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'
)
)
df %>%
mutate(
count = slide_index_int(
.x = id,
.i = date_time,
.f = length,
.before = dhours(48)
)
)
#> # A tibble: 7 × 3
#> id date_time count
#> <int> <dttm> <int>
#> 1 1 2022-05-07 15:00:00 1
#> 2 2 2022-05-09 13:45:00 2
#> 3 3 2022-05-09 13:51:00 3
#> 4 4 2022-05-09 17:00:00 3
#> 5 5 2022-05-10 15:25:00 4
#> 6 6 2022-05-10 17:18:00 5
#> 7 7 2022-05-11 14:00:00 4
How about this...
df %>%
mutate(count48 = map_int(date_time,
~sum(date_time <= . & date_time > . - 48 * 60 * 60)))
# A tibble: 7 × 3
id date_time count48
<int> <dttm> <int>
1 1 2022-05-07 15:00:00 1
2 2 2022-05-09 13:45:00 2
3 3 2022-05-09 13:51:00 3
4 4 2022-05-09 17:00:00 3
5 5 2022-05-10 15:25:00 4
6 6 2022-05-10 17:18:00 5
7 7 2022-05-11 14:00:00 4
Please forgive me but I can't figure out how to create a tibble (could only create data.frame) so I"m pasting my data (I know, bad form) enter code here.
# A tibble: 13 x 2
`Date CADD Rec'd` `CADD Completed`
<dttm> <dttm>
1 2015-01-20 00:00:00 2015-01-19 00:00:00
2 2015-01-29 00:00:00 2017-04-16 00:00:00
3 2016-12-21 00:00:00 2017-12-20 00:00:00
4 2017-01-03 00:00:00 2018-01-03 00:00:00
5 2017-01-03 00:00:00 2017-01-03 00:00:00
6 2021-08-23 00:00:00 NA
7 2021-08-23 00:00:00 2021-12-15 00:00:00
8 2021-08-23 00:00:00 2021-11-23 00:00:00
9 2021-12-25 00:00:00 2021-12-27 00:00:00
10 2022-01-02 00:00:00 NA
11 2022-01-02 00:00:00 NA
12 2022-01-03 00:00:00 2022-01-04 00:00:00
13 2022-01-07 00:00:00 NA
The desired output is " if second column date is before the first column then set the first column equal to the second column. If it is greater than or NA then do not modify first column. I can get it to work but am thrown an error "the condition has length > 1 and only the first element will be used" Does the first element refer to the first row or the first test in my two conditionals (the is.na() and < tests) ? I still do not know how when I run this I get the desired change on the first row but then all the NAs in second column are also behaving unexpectedly (whether I use is.na(issue) or !is.na(issue) the NAs are forced onto first column.
issue = "Date CADD Rec'd"
complete = "CADD Completed"
if ( ( ( is.na(df[complete]) ) & ( df[complete] < df[issue] ) ) ) {
df[issue] <- df[complete] } else {
df[issue] <- df[issue]
}
Resulting in this:
# A tibble: 13 x 2
`Date CADD Rec'd` `CADD Completed`
<dttm> <dttm>
1 2015-01-19 00:00:00 2015-01-19 00:00:00
2 2017-04-16 00:00:00 2017-04-16 00:00:00
3 2017-12-20 00:00:00 2017-12-20 00:00:00
4 2018-01-03 00:00:00 2018-01-03 00:00:00
5 2017-01-03 00:00:00 2017-01-03 00:00:00
6 NA NA
7 2021-12-15 00:00:00 2021-12-15 00:00:00
8 2021-11-23 00:00:00 2021-11-23 00:00:00
9 2021-12-27 00:00:00 2021-12-27 00:00:00
10 NA NA
11 NA NA
12 2022-01-04 00:00:00 2022-01-04 00:00:00
13 NA NA
I'm at a loss as to what's happening with how NA is working or perhaps my [lack] of understanding what the error code is and how it applies. The result I expect is only the first row will have the date in first column change and the 4 NAs will not be forced into first column (they'll preserve their values since second column is NA). I don't understand I'm getting opposite affects with is.na() either.
The dataframe df1 summarizes detections of different individuals (ID) through time (Datetime). As a short example:
library(lubridate)
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Datetime= ymd_hms(c("2016-08-21 00:00:00","2016-08-24 08:00:00","2016-08-23 12:00:00","2016-08-29 03:00:00","2016-08-27 23:00:00","2016-09-02 02:00:00","2016-09-01 12:00:00","2016-09-09 04:00:00","2016-09-01 12:00:00","2016-09-10 12:00:00")))
> df1
ID Datetime
1 1 2016-08-21 00:00:00
2 2 2016-08-24 08:00:00
3 1 2016-08-23 12:00:00
4 2 2016-08-29 03:00:00
5 1 2016-08-27 23:00:00
6 2 2016-09-02 02:00:00
7 1 2016-09-01 12:00:00
8 2 2016-09-09 04:00:00
9 1 2016-09-01 12:00:00
10 2 2016-09-10 12:00:00
I want to calculate for each row, the number of hours (Hours_since_begining) since the first time that the individual was detected.
I would expect something like that (It can contain some mistakes since I did the calculations by hand):
> df1
ID Datetime Hours_since_begining
1 1 2016-08-21 00:00:00 0
2 2 2016-08-24 08:00:00 0
3 1 2016-08-23 12:00:00 60 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-23 12:00:00"
4 2 2016-08-29 03:00:00 115
5 1 2016-08-27 23:00:00 167 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-27 23:00:00"
6 2 2016-09-02 02:00:00 210
7 1 2016-09-01 12:00:00 276
8 2 2016-09-09 04:00:00 380
9 1 2016-09-01 12:00:00 276
10 2 2016-09-10 12:00:00 412
Does anyone know how to do it?
Thanks in advance!
You can do this :
library(tidyverse)
# first get min datetime by ID
min_datetime_id <- df1 %>% group_by(ID) %>% summarise(min_datetime=min(Datetime))
# join with df1 and compute time difference
df1 <- df1 %>% left_join(min_datetime_id) %>% mutate(Hours_since_beginning= as.numeric(difftime(Datetime, min_datetime,units="hours")))
I've got a fairly straight-forward problem, but I'm struggling to find a solution that doesn't require a wall of code and complicated loops.
I've got a summary table, df, for an hourly timeseries dataset where each observations belongs to a group.
I want to merge some of those groups, based on a boolean column in the summary table.
The boolean column, merge_with_next indicates whether a given group should be merged with the next group (one row down).
The merging effectively occurs by updating the end, value and removing rows:
library(dplyr)
# Demo data
df <- tibble(
group = 1:12,
start = seq.POSIXt(as.POSIXct("2019-01-01 00:00"), as.POSIXct("2019-01-12 00:00"), by = "1 day"),
end = seq.POSIXt(as.POSIXct("2019-01-01 23:59"), as.POSIXct("2019-01-12 23:59"), by = "1 day"),
merge_with_next = rep(c(TRUE, TRUE, FALSE), 4)
)
df
#> # A tibble: 12 x 4
#> group start end merge_with_next
#> <int> <dttm> <dttm> <lgl>
#> 1 1 2019-01-01 00:00:00 2019-01-01 23:59:00 TRUE
#> 2 2 2019-01-02 00:00:00 2019-01-02 23:59:00 TRUE
#> 3 3 2019-01-03 00:00:00 2019-01-03 23:59:00 FALSE
#> 4 4 2019-01-04 00:00:00 2019-01-04 23:59:00 TRUE
#> 5 5 2019-01-05 00:00:00 2019-01-05 23:59:00 TRUE
#> 6 6 2019-01-06 00:00:00 2019-01-06 23:59:00 FALSE
#> 7 7 2019-01-07 00:00:00 2019-01-07 23:59:00 TRUE
#> 8 8 2019-01-08 00:00:00 2019-01-08 23:59:00 TRUE
#> 9 9 2019-01-09 00:00:00 2019-01-09 23:59:00 FALSE
#> 10 10 2019-01-10 00:00:00 2019-01-10 23:59:00 TRUE
#> 11 11 2019-01-11 00:00:00 2019-01-11 23:59:00 TRUE
#> 12 12 2019-01-12 00:00:00 2019-01-12 23:59:00 FALSE
# Desired result
desired <- tibble(
group = c(1, 4, 7, 9),
start = c("2019-01-01 00:00", "2019-01-04 00:00", "2019-01-07 00:00", "2019-01-10 00:00"),
end = c("2019-01-03 23:59", "2019-01-06 23:59", "2019-01-09 23:59", "2019-01-12 23:59")
)
desired
#> # A tibble: 4 x 3
#> group start end
#> <dbl> <chr> <chr>
#> 1 1 2019-01-01 00:00 2019-01-03 23:59
#> 2 4 2019-01-04 00:00 2019-01-06 23:59
#> 3 7 2019-01-07 00:00 2019-01-09 23:59
#> 4 9 2019-01-10 00:00 2019-01-12 23:59
Created on 2019-03-22 by the reprex package (v0.2.1)
I'm looking for a short and clear solution that doesn't involve a myriad of helper tables and loops. The final value in the group column is not significant, I only care about the start and end columns from the result.
We can use dplyr and create groups based on every time TRUE value occurs in merge_with_next column and select first value from start and last value from end column for each group.
library(dplyr)
df %>%
group_by(temp = cumsum(!lag(merge_with_next, default = TRUE))) %>%
summarise(group = first(group),
start = first(start),
end = last(end)) %>%
ungroup() %>%
select(-temp)
# group start end
# <int> <dttm> <dttm>
#1 1 2019-01-01 00:00:00 2019-01-03 23:59:00
#2 4 2019-01-04 00:00:00 2019-01-06 23:59:00
#3 7 2019-01-07 00:00:00 2019-01-09 23:59:00
#4 10 2019-01-10 00:00:00 2019-01-12 23:59:00
I have a data object similar to the following:
> temp2 %>% arrange(date_val) %>% select(date_val,kpi_name,kpi_value)
# Source: spark<?> [?? x 3]
# Ordered by: date_val
date_val kpi_name kpi_value
<dttm> <chr> <dbl>
1 2018-12-04 00:00:00 KPI1 0
2 2018-12-04 00:00:00 KPI2 38
3 2018-12-04 00:01:00 KPI2 55
4 2018-12-04 00:01:00 KPI1 1
5 2018-12-04 00:02:00 KPI2 55
6 2018-12-04 00:02:00 KPI1 1
7 2018-12-04 00:03:00 KPI1 0
8 2018-12-04 00:03:00 KPI2 58
9 2018-12-04 00:04:00 KPI2 45
10 2018-12-04 00:04:00 KPI1 1
# ⦠with more rows
>
I would like to insert a new row for each grouped date_val which will perform a calculation for that date_val group on the kpi_name/kpi_value available in the current object. For example, let's say I need to calculate the following new KPI3 as 100*(KPI1/KPI2) which will provide a new data object such as:
# Source: spark<?> [?? x 3]
# Ordered by: date_val
date_val kpi_name kpi_value
<dttm> <chr> <dbl>
1 2018-12-04 00:00:00 KPI1 0
2 2018-12-04 00:00:00 KPI2 38
3 2018-12-04 00:00:00 KPI3 0
4 2018-12-04 00:01:00 KPI2 55
5 2018-12-04 00:01:00 KPI1 1
6 2018-12-04 00:01:00 KPI3 0.018
7 2018-12-04 00:02:00 KPI2 55
8 2018-12-04 00:02:00 KPI1 1
9 2018-12-04 00:02:00 KPI3 0.018
10 2018-12-04 00:03:00 KPI1 0
11 2018-12-04 00:03:00 KPI2 58
12 2018-12-04 00:03:00 KPI3 0
13 2018-12-04 00:04:00 KPI2 45
14 2018-12-04 00:04:00 KPI1 1
15 2018-12-04 00:04:00 KPI3 0.022
# ⦠with more rows
Can this be done in DPLYR?
This should do it:
library(tidyverse)
temp2 %>% spread(kpi_name, kpi_value) %>%
mutate(KPI3 = 100*(KPI1/KPI2)) %>%
gather(kpi_name, kpi_value, -date_val)
While it's technically possible rbind in new rows, it's comparatively inefficient and syntactically clunky. It makes much more sense to transform to the logical wide format, add the column, and then transform back.