How to do a floor_date() in dbplyr - r

I'm trying to aggregate minute-level time series data to hourly level via averaging.
In order to do that I want to calculate an hour column that has the day and hour that the reading occurred in. Then I can do a simple group_by summarise. For instance, my tbl_df looks like:
# Database: Microsoft SQL Server 13.00.4001[<SERVER>/<Project>]
eGauge time Channel End_Use Metric Circuit Reading mean_lag
<int> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 30739 2018-07-06 20:04:00.000 8.0 Clothes Washer P <NA> 0.000033333 60
2 30739 2018-07-06 20:13:00.000 3.0 Clothes Dryer P <NA> 0.000833333 60
3 30739 2018-07-06 21:16:00.000 6.0 Cooktop P <NA> 0.000050000 60
4 30739 2018-07-06 21:00:00.000 3.0 Clothes Dryer P <NA> 0.000833333 60
5 30739 2018-07-06 21:46:00.000 8.0 Clothes Washer P <NA> 0.000016667 60
6 30739 2018-07-07 02:06:00.000 3.0 Clothes Dryer P <NA> 0.001016667 1
7 30739 2018-07-07 08:52:00.000 1.0 Service Mains P <NA> 1.814516667 1
8 30739 2018-07-07 08:52:00.000 3.0 Clothes Dryer P <NA> 0.001050000 1
9 30739 2018-07-07 08:52:00.000 4.0 Central AC P <NA> 0.043000000 1
10 30739 2018-07-07 08:52:00.000 5.0 Oven P <NA> 0.021333333 1
and I would like a new column like this: 2018-07-06 20:00:00.000 or 2018-07-06 20:00:00.000.
Normally I would use floor_date(time, "hour") from lubridate, or even str_replace(time, ".{2}(?=:[^:]*$)", "00"), but neither are working for me with my SQL Server connection.
Any idea how this is done in R? Answer must R code and preferrably be dplyr code such as:
# NOT WORKING
my_table %>%
mutate(time_hour = floor_date(time, "hour"))
or
# NOT WORKING
my_table %>%
mutate(time_hour = DATEADD('hour', DATEDIFF('hour', 0, time), 0))

my_table %>%
mutate(time_hour = DATEADD(sql("hour"), DATEDIFF(sql("hour"), 0, time), 0))

WORKS BUT NEEDS IMPROVEMENT
my_table %>%
mutate(hour = "hour",
time_hour = DATEADD(hour, DATEDIFF(hour, 0, time), 0)) %>%
select(-hour)

Related

Remove duplicates based on multiple conditions

I have some individuals that are listed twice because they have received numerous degrees. I am trying to only get the rows with the latest degree granting date. Below are examples of the current output and the desired output
people | g_date | wage|quarter
personA|2009-01-01|100 |20201
personA|2009-01-01|100 |20202
personA|2010-01-01|100 |20201
personA|2010-01-01|100 |20202
personB|2012-01-01|50 |20201
personB|2012-01-01|50 |20202
personB|2012-01-01|50 |20203
Desired output
people | g_date | wage|quarter
personA|2010-01-01|100 |20201
personA|2010-01-01|100 |20202
personB|2012-01-01|50 |20201
personB|2012-01-01|50 |20202
personB|2012-01-01|50 |20203
I have used the code that is below but it is removing all the rows so that there is only one row per person.
df<-df[order(df$g_date),]
df<-df[!duplicated(df$people, fromLast = TRUE),]
Another option using group_by with ordered slice_max like this:
library(dplyr)
df %>%
group_by(people, quarter) %>%
slice_max(order_by = g_date, n = 1)
#> # A tibble: 5 × 4
#> # Groups: people, quarter [5]
#> people g_date wage quarter
#> <chr> <chr> <dbl> <int>
#> 1 personA 2010-01-01 100 20201
#> 2 personA 2010-01-01 100 20202
#> 3 personB 2012-01-01 50 20201
#> 4 personB 2012-01-01 50 20202
#> 5 personB 2012-01-01 50 20203
Created on 2022-12-15 with reprex v2.0.2
merge(df, aggregate(. ~ people, df[1:2], max))
#> people g_date wage quarter
#> 1 personA 2010-01-01 100 20201
#> 2 personA 2010-01-01 100 20202
#> 3 personB 2012-01-01 50 20201
#> 4 personB 2012-01-01 50 20202
#> 5 personB 2012-01-01 50 20203
Update (thanks to #Villalba, removed first answer):
We colud first group arrange and then filter:
library(dplyr)
library(lubridate)
df %>%
group_by(people, quarter) %>%
mutate(g_date = ymd(g_date)) %>%
arrange(g_date, .by_group = TRUE) %>%
filter(row_number()==n())
people g_date wage quarter
<chr> <date> <int> <int>
1 personA 2010-01-01 100 20201
2 personA 2010-01-01 100 20202
3 personB 2012-01-01 50 20201
4 personB 2012-01-01 50 20202
5 personB 2012-01-01 50 20203

Comparing Dates Across Multiple Variables

I'm attempting to figure out the amount of days in between games and if that has an impact on wins/losses, this is the information I'm starting with:
schedule:
Home
Away
Home_Final
Away_Final
Date
DAL
OAK
30
35
9/1/2015
KC
PHI
21
28
9/2/2015
This is the result I'd like to get:
Home
Away
Home_Final
Away_Final
Date
Home_Rest
Away_Rest
Adv
Adv_Days
Adv_Won
DAL
OAK
30
35
9/1/2015
null
null
null
null
null
KC
PHI
21
28
9/2/2015
null
null
null
null
null
DAL
PHI
28
7
9/9/2015
8
7
1
1
1
OAK
KC
14
21
9/9/2015
8
7
1
1
0
'Home_Rest' = The home teams amount of days between their games
'Away Rest' = The away teams amount of days between their games
'Adv' = True/False that there was an advantage on one side
'Adv_Days' = The amount of advantage in days
'Adv_Won' = The side with the advantage won
Here is what I've tried, I was able to get it to count how many days were between games for one team, but when I bring all the other ones in I can't wrap my head around how to do that.
library(tidyverse)
library(lubridate)
team_post <- schedule %>% filter(home == 'PHI' | visitor == 'PHI')
day_dif = interval(lag(ymd(team_post$date)), ymd(team_post$date))
team_post <- team_post %>% mutate(days_off = time_length(day_dif, "days"))
You can extend this to all teams using a grouped mutate. See docs for group_by() here.
Something like
schedule %>%
group_by(vars_to_group_by) %>%
mutate(new_var = expr_to_calculate_new_var)
In future, it would be helpful if you included code to recreate a minimal dataset for your example.
The problem is that before you can calculate differences between dates, you must put your dataframe in a friendlier format. Because the Date applies to both teams, that is, one item applies to two columns in the dataframe, which makes it difficult to give it a uniform treatment.
We'll add an id (row number) to the schedule dataframe, as a primary key, so it becomes easy to identify the rows later on.
schedule <- tidyr::tribble(
~Home, ~Away, ~Home_Final, ~Away_Final, ~Date,
"DAL", "OAK", 30, 35, "9/1/2015",
"KC", "PHI", 21, 28, "9/2/2015",
"DAL", "PHI", 28, 7, "9/9/2015",
"OAK", "KC", 14, 21, "9/9/2015"
)
schedule <- schedule %>% mutate(id = row_number())
> schedule
# A tibble: 4 x 6
Home Away Home_Final Away_Final Date id
<chr> <chr> <dbl> <dbl> <chr> <int>
1 DAL OAK 30 35 9/1/2015 1
2 KC PHI 21 28 9/2/2015 2
3 DAL PHI 28 7 9/9/2015 3
4 OAK KC 14 21 9/9/2015 4
Now we'll place your dataframe in a more 'relational' format.
schedule_relational <-
rbind(
schedule %>%
transmute(
id,
Team = Home,
Role = "Home",
Final = Home_Final,
Date
),
schedule %>%
transmute(
id,
Team = Away,
Role = "Away",
Final = Away_Final,
Date
)
)
> schedule_relational
# A tibble: 8 x 5
id Team Role Final Date
<int> <chr> <chr> <dbl> <chr>
1 1 DAL Home 30 9/1/2015
2 2 KC Home 21 9/2/2015
3 3 DAL Home 28 9/9/2015
4 4 OAK Home 14 9/9/2015
5 1 OAK Away 35 9/1/2015
6 2 PHI Away 28 9/2/2015
7 3 PHI Away 7 9/9/2015
8 4 KC Away 21 9/9/2015
How about that!
Now it becomes easy to calculate the difference between dates of games for each team:
schedule_relational <-
schedule_relational %>%
group_by(Team) %>%
arrange(Date) %>%
mutate(Rest = mdy(Date) - mdy(lag(Date))) %>%
ungroup()
> schedule_relational
# A tibble: 8 x 6
id Team Role Final Date Rest
<int> <chr> <chr> <dbl> <chr> <drtn>
1 1 DAL Home 30 9/1/2015 NA days
2 1 OAK Away 35 9/1/2015 NA days
3 2 KC Home 21 9/2/2015 NA days
4 2 PHI Away 28 9/2/2015 NA days
5 3 DAL Home 28 9/9/2015 8 days
6 4 OAK Home 14 9/9/2015 8 days
7 3 PHI Away 7 9/9/2015 7 days
8 4 KC Away 21 9/9/2015 7 days
Observe that the appropriate function to convert dates in character format is mdy(), because your dates are in month/day/year format.
We're very close to a solution! Now all we have to do is to pivot your data back to the wider format. We'll join back the data on the home team and away team by using the id as our unique key.
result <-
schedule_relational %>%
pivot_wider(
names_from = Role,
values_from = c(Team, Final, Rest),
names_glue = "{Role}_{.value}"
)
> result
# A tibble: 4 x 8
id Date Home_Team Away_Team Home_Final Away_Final Home_Rest Away_Rest
<int> <chr> <chr> <chr> <dbl> <dbl> <drtn> <drtn>
1 1 9/1/2015 DAL OAK 30 35 NA days NA days
2 2 9/2/2015 KC PHI 21 28 NA days NA days
3 3 9/9/2015 DAL PHI 28 7 8 days 7 days
4 4 9/9/2015 OAK KC 14 21 8 days 7 days
We'll adjust column names and ordering, and make the final calculations now.
result_final <-
result %>%
transmute(
Home = Home_Team,
Away = Away_Team,
Home_Final,
Away_Final,
Date,
Home_Rest,
Away_Rest,
Adv = as.integer(Home_Rest != Away_Rest),
Adv_Days = abs(Home_Rest != Away_Rest),
Adv_Won = as.integer(Home_Rest > Away_Rest & Home_Final > Away_Final | Away_Rest > Home_Rest & Away_Final > Home_Final)
)
> result_final
# A tibble: 4 x 10
Home Away Home_Final Away_Final Date Home_Rest Away_Rest Adv Adv_Days Adv_Won
<chr> <chr> <dbl> <dbl> <chr> <drtn> <drtn> <int> <int> <int>
1 DAL OAK 30 35 9/1/2015 NA days NA days NA NA NA
2 KC PHI 21 28 9/2/2015 NA days NA days NA NA NA
3 DAL PHI 28 7 9/9/2015 8 days 7 days 1 1 1
4 OAK KC 14 21 9/9/2015 8 days 7 days 1 1 0
It would be interesting if instead of reducing Adv and Adv_Won to yes/no (discrete) values, you'd track the number of days of rest and difference in score. Therefore you could correlate the results also in terms of magnitude.
I've made the code step by step, so you can see intermediate results and understand it better. You may later coalesce some of the statements if you want.
There may be more convoluted solutions, but this is very clear to read and understand.

R Filter by sequence of events

I have an event log data. For reproducible example, let's use the data from eventdataR
eventdataR::patients
## look at patient 1 sequence
eventdataR::patients %>% dplyr::filter(patient == '1')
# A tibble: 12 x 7
handling patient employee handling_id registration_ty~ time .order
<fct> <chr> <fct> <chr> <fct> <dttm> <int>
1 Registration 1 r1 1 start 2017-01-02 11:41:53 1
2 Triage and A~ 1 r2 501 start 2017-01-02 12:40:20 2
3 Blood test 1 r3 1001 start 2017-01-05 08:59:04 3
4 MRI SCAN 1 r4 1238 start 2017-01-05 21:37:12 4
5 Discuss Resu~ 1 r6 1735 start 2017-01-07 07:57:49 5
6 Check-out 1 r7 2230 start 2017-01-09 17:09:43 6
7 Registration 1 r1 1 complete 2017-01-02 12:40:20 7
8 Triage and A~ 1 r2 501 complete 2017-01-02 22:32:25 8
9 Blood test 1 r3 1001 complete 2017-01-05 14:34:27 9
10 MRI SCAN 1 r4 1238 complete 2017-01-06 01:54:23 10
11 Discuss Resu~ 1 r6 1735 complete 2017-01-07 10:18:08 11
12 Check-out 1 r7 2230 complete 2017-01-09 19:45:45 12
In the above example, we can see the sequence of handling for patient 1 over a period of time. We can imagine that different patients would have different sequences or went through different number of sequences.
Now let's say I'm interested in a specific sequence and want to know which patients had gone through this specific sequence. How can I filter this dataset by this specific sequence so that I can get to know who these patients are?
The filter_activity_presence from edeaR library can help me with identifying the unique sequences and its frequency
patients %>% traces
# A tibble: 7 x 3
trace absolute_frequen~ relative_frequen~
<chr> <int> <dbl>
1 Registration,Triage and Assessment,X-Ray,Discuss R~ 258 0.516
2 Registration,Triage and Assessment,Blood test,MRI ~ 234 0.468
3 Registration,Triage and Assessment,Blood test,MRI ~ 2 0.004
4 Registration,Triage and Assessment,X-Ray 2 0.004
5 Registration,Triage and Assessment 2 0.004
6 Registration,Triage and Assessment,X-Ray,Discuss R~ 1 0.002
7 Registration,Triage and Assessment,Blood test 1 0.002
Let's say I'm interested in sequence from row 5, that is patients who had exclusively this sequence Registration -> Triage -> Assessment, there are 2 patients who had this sequence.
It seems the library that doesn't provide ready made function to extract this. At least from this doc page, https://www.bupar.net/subsetting.html#trace_length, it's not available.
Basically, given an exhaustive list of sequence, return all the patients who had gone through exactly this sequence.
In fact, if I can rebuild the trace and map it back to the original dataset, that should allow for a simple dplyr::filter. But this may not be ideal as well in the case if I'm interested in open ended sequence, for example, find all patients who started with Registration -> Triage and can be followed by any sequence.
Here's my long-winded attempt
# get trace for each patient
patient_trace <- as_tibble(patients) %>% group_by(patient) %>% dplyr::filter(registration_type == 'complete') %>%
summarise(trace = paste(handling, collapse = ","), n = n())
# identify the sequence trace of interest
trace_summary <- patients %>% traces
# here we want to see patients who had the sequence from row 5
res <- patients %>%
dplyr::filter(patient %in% c(patient_trace %>% dplyr::filter(trace %in% trace_summary$trace[5]) %>% .$patient)) %>%
dplyr::filter(registration_type == 'complete') %>%
arrange(patient, time)
# A tibble: 4 x 7
handling patient employee handling_id registration_ty~ time .order
<fct> <chr> <fct> <chr> <fct> <dttm> <int>
1 Registration 499 r1 499 complete 2018-05-01 22:57:38 1
2 Triage and As~ 499 r2 999 complete 2018-05-04 23:53:27 3
3 Registration 500 r1 500 complete 2018-05-02 01:28:23 2
4 Triage and As~ 500 r2 1000 complete 2018-05-05 07:16:02 4
You can filter them with dplyr :
library(dplyr)
req_sequence <- c('Registration', 'Triage and Assessment')
eventdataR::patients %>%
group_by(patient) %>%
filter(all(handling == req_sequence)) %>%
filter(registration_type == 'complete') %>%
ungroup
# handling patient employee handling_id registration_type time .order
# <fct> <chr> <fct> <chr> <fct> <dttm> <int>
#1 Registration 499 r1 499 complete 2018-05-01 22:57:38 3220
#2 Registration 500 r1 500 complete 2018-05-02 01:28:23 3221
#3 Triage and Assessment 499 r2 999 complete 2018-05-04 23:53:27 3720
#4 Triage and Assessment 500 r2 1000 complete 2018-05-05 07:16:02 3721
For this case to be sure of the output and to avoid any recycling effect we can filter registration_type == 'complete' first and also add another check of length(req_sequence) equal to number of rows for the patient id.
eventdataR::patients %>%
filter(registration_type == 'complete') %>%
group_by(patient) %>%
filter(length(req_sequence) == n() && all(handling == req_sequence)) %>%
ungroup

Time difference calculated from wide data with missing rows

There is a longitudinal data set in the wide format, from which I want to compute time (in years and days) between the first observation date and the last date an individual was observed. Dates are in the format yyyy-mm-dd. The data set has four observation periods with missing dates, an example is as follows
df1<-data.frame("id"=c(1:4),
"adate"=c("2011-06-18","2011-06-18","2011-04-09","2011-05-20"),
"bdate"=c("2012-06-15","2012-06-15",NA,"2012-05-23"),
"cdate"=c("2013-06-18","2013-06-18","2013-04-09",NA),
"ddate"=c("2014-06-15",NA,"2014-04-11",NA))
Here "adate" is the first date and the last date is the date an individual was last seen. To compute the time difference (lastdate-adate), I have tried using "lubridate" package, for example
lubridate::time_length(difftime(as.Date("2012-05-23"), as.Date("2011-05-20")),"years")
However, I'm challenged by the fact that the last date is not coming from one column. I'm looking for a way to automate the calculation in R. The expected output would look like
id years days
1 1 2.99 1093
2 2 2.00 731
3 3 3.01 1098
4 4 1.01 369
Years is approximated to 2 decimal places.
Another tidyverse solution can be done by converting the data to long format, removing NA dates, and getting the time difference between last and first date for each id.
library(dplyr)
library(tidyr)
library(lubridate)
df1 %>%
pivot_longer(-id) %>%
na.omit %>%
group_by(id) %>%
mutate(value = as.Date(value)) %>%
summarise(years = time_length(difftime(last(value), first(value)),"years"),
days = as.numeric(difftime(last(value), first(value))))
#> # A tibble: 4 x 3
#> id years days
#> <int> <dbl> <dbl>
#> 1 1 2.99 1093
#> 2 2 2.00 731
#> 3 3 3.01 1098
#> 4 4 1.01 369
We could use pmap
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
mutate(out = pmap(.[-1], ~ {
dates <- as.Date(na.omit(c(...)))
tibble(years = lubridate::time_length(difftime(last(dates),
first(dates)), "years"),
days = lubridate::time_length(difftime(last(dates), first(dates)), "days"))
})) %>%
unnest_wider(out)
# A tibble: 4 x 7
# id adate bdate cdate ddate years days
# <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
#1 1 2011-06-18 2012-06-15 2013-06-18 2014-06-15 2.99 1093
#2 2 2011-06-18 2012-06-15 2013-06-18 <NA> 2.00 731
#3 3 2011-04-09 <NA> 2013-04-09 2014-04-11 3.01 1098
#4 4 2011-05-20 2012-05-23 <NA> <NA> 1.01 369
Probably most of the functions introduced here might be quite complex. You should try to learn them if possible. Although will provide a Base R approach:
grp <- droplevels(interaction(df[,1],row(df[-1]))) # Create a grouping:
days <- tapply(unlist(df[-1]),grp, function(x)max(x,na.rm = TRUE) - x[1]) #Get the difference
cbind(df[1],days, years = round(days/365,2)) # Create your table
id days years
1.1 1 1093 2.99
2.2 2 731 2.00
3.3 3 1098 3.01
4.4 4 369 1.01
if comfortable with other higher functions then you could do:
dat <- aggregate(adate~id,reshape(df1,list(2:ncol(df1)), dir="long"),function(x)max(x) - x[1])
transform(dat,year = round(adate/365,2))
id adate year
1 1 1093 2.99
2 2 731 2.00
3 3 1098 3.01
4 4 369 1.01
Using base R apply :
df1[-1] <- lapply(df1[-1], as.Date)
df1[c('years', 'days')] <- t(apply(df1[-1], 1, function(x) {
x <- na.omit(x)
x1 <- difftime(x[length(x)], x[1], 'days')
c(x1/365, x1)
}))
df1[c('id', 'years', 'days')]
# id years days
#1 1 2.994521 1093
#2 2 2.002740 731
#3 3 3.008219 1098
#4 4 1.010959 369

time differences for multiple events for same ID in R

I'm new to Stackoverflow and looked at similar posts but couldn't find a solution that can capture time differences from multiple events from the same ID.
What I've got:
Time<-c('2016-10-04','2016-10-18', '2016-10-04','2016-10-18','2016-10-19','2016-10-28','2016-10-04','2016-10-19','2016-10-21','2016-10-22', '2017-01-02', '2017-03-04')
Value<-c(0,1,0,1,0,0,0,1,0,1,1,0)
StoreID<-c('a','a','b','b','c','c','d','d','a','a','d','c')
Unit<-c(1,1,2,2,5,5,6,6,1,1,6,5)
Helper<-c('a1','a1','b2','b2','c5','c5','d6','d6','a1','a1','d6','c5')
The helper column is the StoreID and Unit combined because I couldn't figure out how to group by both Store ID and the Unit. I want to sort the data to show when the unit was disabled (value =0) and enabled again (value =1).
Ultimately, I'd want:
Store_ID Unit Helper Time(v=0) Time(v=1) Time2(v=0) Time 2(v=1)
a 1 a1 2016-10-04 2016-10-18 2016-10-21 2016-10-22
b 2 b2 2016-10-04 2016-10-18
c 5 c5 2016-10-19 2016-10-28 2017-03-04
d 6 d6 2016-10-04 2017-10-19
Any thoughts?
I'm thinking something in dplyr but am stumped about where to go further.
Create a Header column that combines the Value column and the row number that distinguishes duplicates, then spread to wide format:
Didn't use the helper column, grouped by StoredID and Unit instead.
df <- data.frame(StoreID, Unit, Time, Value)
df %>%
group_by(StoreID, Unit, Value) %>%
mutate(Headers = sprintf('Time %s (v=%s)', row_number(), Value)) %>%
ungroup() %>% select(-Value) %>%
spread(Headers, Time)
# A tibble: 4 x 7
# StoreID Unit `Time 1 (v=0)` `Time 1 (v=1)` `Time 2 (v=0)` `Time 2 (v=1)` `Time 3 (v=0)`
#* <fctr> <dbl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 a 1 2016-10-04 2016-10-18 2016-10-21 2016-10-22 NA
#2 b 2 2016-10-04 2016-10-18 NA NA NA
#3 c 5 2016-10-19 NA 2016-10-28 NA 2017-03-04
#4 d 6 2016-10-04 2016-10-19 NA 2017-01-02 NA

Resources