R - Condensing rows based on continuity of start and end timestamps - r

I have some IDs with timestamps that are exhibiting behavior that is out of my control. This is causing new rows for the same ID, but the end_time for the first row of a given ID matches the start_time of the next row for that same ID. An example looks like this:
df <- data.frame(id = c("1", "1", "1", "2", "3", "3"),
start_time = c("7/4/2020 10:06:27", "7/16/2020 07:16:44", "7/16/2020 07:20:32", "7/9/2020 03:27:37", "7/4/2020 02:01:49", "7/9/2020 00:00:00"),
end_time = c("7/16/2020 07:16:44", "7/16/2020 07:20:32", "7/25/2020 18:17:46", "7/21/2020 20:13:16", "7/5/2020 09:17:54", "7/11/2020 15:43:22"))
> df
id start_time end_time
1 1 7/4/2020 10:06:27 7/16/2020 07:16:44
2 1 7/16/2020 07:16:44 7/16/2020 07:20:32
3 1 7/16/2020 07:20:32 7/25/2020 18:17:46
4 2 7/9/2020 03:27:37 7/21/2020 20:13:16
5 3 7/4/2020 02:01:49 7/5/2020 09:17:54
6 3 7/9/2020 00:00:00 7/11/2020 15:43:22
But I want to 'stitch' together the timestamps for id = 1 (and other possible rows exhibiting this behavior) to for one distinct row like below. Note how the end_time and start_time entries for this id match up identically.
> df
id start_time end_time
1 1 7/4/2020 10:06:27 7/25/2020 18:17:46
2 2 7/9/2020 03:27:37 7/21/2020 20:13:16
3 3 7/4/2020 02:01:49 7/5/2020 09:17:54
4 3 7/9/2020 00:00:00 7/11/2020 15:43:22
The entries for id = 3 are appropriate because the end_time for the first instance is instance from the start time of the second instance, so the desired behavior be for those rows to be unchanged

Here's a dplyr solution:
library(dplyr)
df %>%
mutate_at(vars(start_time, end_time), ~ as.POSIXct(., format = "%m/%d/%Y %T")) %>%
arrange(id, start_time) %>%
group_by(id) %>%
mutate(grp = cumsum(start_time != dplyr::lag(end_time, default=end_time[1]))) %>%
group_by(id, grp) %>%
summarize(start_time = start_time[1], end_time = end_time[n()]) %>%
ungroup() %>%
select(-grp)
# # A tibble: 4 x 3
# id start_time end_time
# <chr> <dttm> <dttm>
# 1 1 2020-07-04 10:06:27 2020-07-25 18:17:46
# 2 2 2020-07-09 03:27:37 2020-07-21 20:13:16
# 3 3 2020-07-04 02:01:49 2020-07-05 09:17:54
# 4 3 2020-07-09 00:00:00 2020-07-11 15:43:22
Edited to add as.POSIXct. One comment was to make sure to sort by start_time to make sure that comparisons were good, but realize that since those were originally strings, that there was no guarantee that they would sort correctly. For instance, sort(c("7/31/2020", "12/01/2020")) sorts incorrectly. Unlikely given this data, but still a good safeguard.

Related

R create week numbers with specified start date

This seems like it should be straightforward but I cannot find a way to do this.
I have a sales cycle that begins ~ August 1 of each year and need to sum sales by week number. I need to create a "week number" field where week #1 begins on a date that I specify. Thus far I have looked at lubridate, baseR, and strftime, and I cannot find a way to change the "start" date from 01/01/YYYY to something else.
Solution needs to let me specify the start date and iterate week numbers as 7 days from the start date. The actual start date doesn't always occur on a Sunday or Monday.
EG Data Frame
eg_data <- data.frame(
cycle = c("cycle2019", "cycle2019", "cycle2018", "cycle2018", "cycle2017", "cycle2017", "cycle2016", "cycle2016"),
dates = as.POSIXct(c("2019-08-01" , "2019-08-10" ,"2018-07-31" , "2018-08-16", "2017-08-03" , "2017-08-14" , "2016-08-05", "2016-08-29")),
week_n = c("1", "2","1","3","1","2","1","4"))
I'd like the result to look like what is above - it would take the min date for each cycle and use that as a starting point, then iterate up week numbers based on a given date's distance from the cycle starting date.
This almost works. (Doing date arithmetic gives us durations in seconds: there may be a smoother way to convert with lubridate tools?)
secs_per_week <- 60*60*24*7
(eg_data
%>% group_by(cycle)
%>% mutate(nw=1+as.numeric(round((dates-min(dates))/secs_per_week)))
)
The results don't match for 2017, because there is an 11-day gap between the first and second observation ...
cycle dates week_n nw
<chr> <dttm> <chr> <dbl>
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 3
If someone has a better answer plz post, but this works -
Take the dataframe in the example, eg_data -
eg_data %>%
group_by(cycle) %>%
mutate(
cycle_start = as.Date(min(dates)),
days_diff = as.Date(dates) - cycle_start,
week_n = days_diff / 7,
week_n_whole = ceiling(days_diff / 7) ) -> eg_data_check
(First time I've answered my own question)
library("lubridate")
eg_data %>%
as_tibble() %>%
group_by(cycle) %>%
mutate(new_week = week(dates)-31)
This doesn't quite work the same as your example, but perhaps with some fiddling based on your domain experience you could adapt it:
library(lubridate)
eg_data %>%
mutate(aug1 = ymd_h(paste(str_sub(cycle, start = -4), "080100")),
week_n2 = ceiling((dates - aug1)/ddays(7)))
EDIT: If you have specific known dates for the start of each cycle, it might be helpful to join those dates to your data for the calc:
library(lubridate)
cycle_starts <- data.frame(
cycle = c("cycle2019", "cycle2018", "cycle2017", "cycle2016"),
start_date = ymd_h(c(2019080100, 2018072500, 2017080500, 2016071300))
)
eg_data %>%
left_join(cycle_starts) %>%
mutate(week_n2 = ceiling((dates - start_date)/ddays(7)))
#Joining, by = "cycle"
# cycle dates week_n start_date week_n2
#1 cycle2019 2019-08-01 1 2019-08-01 1
#2 cycle2019 2019-08-10 2 2019-08-01 2
#3 cycle2018 2018-07-31 1 2018-07-25 1
#4 cycle2018 2018-08-16 3 2018-07-25 4
#5 cycle2017 2017-08-03 1 2017-08-05 0
#6 cycle2017 2017-08-14 2 2017-08-05 2
#7 cycle2016 2016-08-05 1 2016-07-13 4
#8 cycle2016 2016-08-29 4 2016-07-13 7
This is a concise solution using lubridate
library(lubridate)
eg_data %>%
group_by(cycle) %>%
mutate(new_week = floor(as.period(ymd(dates) - ymd(min(dates))) / weeks()) + 1)
# A tibble: 8 x 4
# Groups: cycle [4]
cycle dates week_n new_week
<chr> <dttm> <chr> <dbl>
1 cycle2019 2019-08-01 00:00:00 1 1
2 cycle2019 2019-08-10 00:00:00 2 2
3 cycle2018 2018-07-31 00:00:00 1 1
4 cycle2018 2018-08-16 00:00:00 3 3
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 2
7 cycle2016 2016-08-05 00:00:00 1 1
8 cycle2016 2016-08-29 00:00:00 4 4

Grouping temporally arranged events across sites dplyr

I'm working with an ecological dataset that has multiple individuals moving across a landscape where they can be detected at multiple sites. The data has a beginning and ending timestamp when an individual was detected at a given site; heron we'll call this time window for an individual at a site an "event". These events are the rows in this data. I sorted this data by time, and noticed I can have multiple events while an individual remains at a given site (which can be due to an individual moving away from the receiver and coming back to it while not being detected at an adjacent receiver).
Here's example data for a single individual, x:
input <- data.frame(individual = c("x","x","x","x","x","x","x"),
site = c("a","a","a","b","b","a", "a"),
start_time = as.POSIXct(c("2020-01-14 11:11:11", "2020-01-14 11:13:10", "2020-01-14 11:16:20",
"2020-02-14 11:11:11", "2020-02-14 11:13:10",
"2020-03-14 11:12:11", "2020-03-15 11:12:11")),
end_time = as.POSIXct(c("2020-01-14 11:11:41", "2020-01-14 11:13:27", "2020-01-14 11:16:50",
"2020-02-14 11:13:11", "2020-02-14 11:15:10",
"2020-03-14 11:20:11", "2020-03-15 11:20:11")))
I want to aggregate these smaller events (e.g. the first 3 events at site a) into one larger event where I summarize the start/end times for the whole event:
output <- data.frame(individual = c("x","x","x"), site = c("a", "b", "a"),
start_time = as.POSIXct(c("2020-01-14 11:11:11", "2020-02-14 11:11:11", "2020-03-14 11:12:11")),
end_time = as.POSIXct(c("2020-01-14 11:16:50", "2020-02-14 11:15:10", "2020-03-15 11:20:11")))
Note that time intervals for events vary.
Using group_by(individual, site) would mean losing this temporal info, since individuals can travel among sites multiple times. I thought about using some sort of helper dataframe that summarizes events for individuals at sites but I am not sure how to retain the temporal info. I suppose there is a way to do this by indexing row numbers/looping in base but I am hoping there is a nifty dplyr trick that can help with this problem.
One approach would be to take the cumulative sum of times that site has changed, and use that count to summarize each individual's contiguous times at one site.
library(dplyr)
input %>%
arrange(individual, start_time) %>%
mutate(indiv_new_site = cumsum(site != lag(site, default = ""))) %>%
group_by(individual, site, indiv_new_site) %>%
summarize(start_time = min(start_time),
end_time = max(end_time))
# A tibble: 3 x 5
# Groups: individual, site [2]
individual site indiv_new_site start_time end_time
<chr> <chr> <int> <dttm> <dttm>
1 x a 1 2020-01-14 11:11:11 2020-01-14 11:16:50
2 x a 3 2020-03-14 11:12:11 2020-03-15 11:20:11
3 x b 2 2020-02-14 11:11:11 2020-02-14 11:15:10
We could use rle from base R
library(dplyr)
input %>%
arrange(individual, start_time) %>%
group_by(individual, site, grp = with(rle(site),
rep(seq_along(values), lengths))) %>%
summarize(start_time = min(start_time),
end_time = max(end_time), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 3 x 4
# individual site start_time end_time
# <chr> <chr> <dttm> <dttm>
#1 x a 2020-01-14 11:11:11 2020-01-14 11:16:50
#2 x a 2020-03-14 11:12:11 2020-03-15 11:20:11
#3 x b 2020-02-14 11:11:11 2020-02-14 11:15:10
In data.table we can use rleid.
library(data.table)
setDT(input)
input[, .(site = first(site),
start_time = min(start_time),
end_time = max(end_time)), .(individual, rleid(site))]
# individual rleid site start_time end_time
#1: x 1 a 2020-01-14 11:11:11 2020-01-14 11:16:50
#2: x 2 b 2020-02-14 11:11:11 2020-02-14 11:15:10
#3: x 3 a 2020-03-14 11:12:11 2020-03-15 11:20:11

Assigning total to correct month from date range

I've got a data set with reservation data that has the below format :
property <- c('casa1', 'casa2', 'casa3')
check_in <- as.Date(c('2018-01-01', '2018-01-30','2018-02-28'))
check_out <- as.Date(c('2018-01-02', '2018-02-03', '2018-03-02'))
total_paid <- c(100,110,120)
df <- data.frame(property,check_in,check_out, total_paid)
My goal is to have the monthly total_paid amount divided by days and assigned to each month correctly for budget reasons.
While there's no issue for casa1, casa2 and casa3 have days reserved in both months and the totals get skewed because of this issue.
Any help much appreciated!
Here you go:
library(dplyr)
library(tidyr)
df %>%
mutate(id = seq_along(property), # make few variable to help
day_paid = total_paid / as.numeric(check_out - check_in),
date = check_in) %>%
group_by(id) %>%
complete(date = seq.Date(check_in, (check_out - 1), by = "day")) %>% # get date for each day of stay (except last)
ungroup() %>% # make one row per day of stay
mutate(month = cut(date, breaks = "month")) %>% # determine month of date
fill(property, check_in, check_out, total_paid, day_paid) %>%
group_by(id, month) %>%
summarise(property = unique(property),
check_in = unique(check_in),
check_out = unique(check_out),
total_paid = unique(total_paid),
paid_month = sum(day_paid)) # summarise per month
result:
# A tibble: 5 x 7
# Groups: id [3]
id month property check_in check_out total_paid paid_month
<int> <fct> <fct> <date> <date> <dbl> <dbl>
1 1 2018-01-01 casa1 2018-01-01 2018-01-02 100 100
2 2 2018-01-01 casa2 2018-01-30 2018-02-03 110 55
3 2 2018-02-01 casa2 2018-01-30 2018-02-03 110 55
4 3 2018-02-01 casa3 2018-02-28 2018-03-02 120 60
5 3 2018-03-01 casa3 2018-02-28 2018-03-02 120 60
I hope it's somewhat readable but please ask if there is something I should explain. Convention is that people don't pay the last day of a stay, so I took that into account.

Sum between two weeks interval

Suppose I have a daily rain data.frame like this:
df.meteoro = data.frame(Dates = seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"),
rain = rnorm(length(seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"))))
I'm trying to sum the accumulated rain between a 14 days interval with this code:
library(tidyverse)
library(lubridate)
df.rain <- df.meteoro %>%
mutate(TwoWeeks = round_date(df.meteoro$data, "14 days")) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
The problem is that it isn't starting on 2017-01-19 but on 2017-01-15 and I was expecting my output dates to be:
"2017-02-02" "2017-02-16" "2017-03-02" "2017-03-16" "2017-03-30" "2017-04-13"
"2017-04-27" "2017-05-11" "2017-05-25" "2017-06-08" "2017-06-22" "2017-07-06" "2017-07-20"
"2017-08-03" "2017-08-17" "2017-08-31" "2017-09-14" "2017-09-28" "2017-10-12" "2017-10-26"
"2017-11-09" "2017-11-23" "2017-12-07" "2017-12-21" "2018-01-04" "2018-01-18"
TL;DR I have a year long daily rain data.frame and want to sum the accumulate rain for the dates above.
Please help.
Use of round_date in the way you have shown it will not give you 14-day periods as you might expect. I have taken a different approach in this solution and generated a sequence of dates between your first and last dates and grouped these into 14-day periods then joined the dates to your observations.
startdate = min(df.meteoro$Dates)
enddate = max(df.meteoro$Dates)
dateseq =
data.frame(Dates = seq.Date(startdate, enddate, by = 1)) %>%
mutate(group = as.numeric(Dates - startdate) %/% 14) %>%
group_by(group) %>%
mutate(starts = min(Dates))
df.rain <- df.meteoro %>%
right_join(dateseq) %>%
group_by(starts) %>%
summarise(sum_rain = sum(rain))
head(df.rain)
> head(df.rain)
# A tibble: 6 x 2
starts sum_rain
<date> <dbl>
1 2017-01-19 6.09
2 2017-02-02 5.55
3 2017-02-16 -3.40
4 2017-03-02 2.55
5 2017-03-16 -0.12
6 2017-03-30 8.95
Using a right-join to the date sequence is to ensure that if there are missing observation days that spanned a complete time period you'd still get that period listed in the result (though in your case you have a complete year of dates anyway).
round_date rounds to the nearest multiple of unit (here, 14 days) since some epoch (probably the Unix epoch of 1970-01-01 00:00:00), which doesn't line up with your purpose.
To get what you want, you can do the following:
df.rain = df.meteoro %>%
mutate(days_since_start = as.numeric(Dates - as.Date("2017/1/18")),
TwoWeeks = as.Date("2017/1/18") + 14*ceiling(days_since_start/14)) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
This computes days_since_start as the days since 2017/1/18 and then manually rounds to the next multiple of two weeks.
Assuming you want to round to the closest date from the ones you have specified I guess the following will work
targetDates<-seq(ymd("2017-02-02"),ymd("2018-01-18"),by='14 days')
df.meteoro$Dates=targetDates[sapply(df.meteoro$Dates,function(x) which.min(abs(interval(targetDates,x))))]
sum_rain=ddply(df.meteoro,.(Dates),summarize,sum_rain=sum(rain,na.rm=T))
as you can see not all dates have the same number of observations. Date "2017-02-02" for instance has all the records between "2017-01-19" until "2017-02-09", which are 22 records. From "2017-02-10" on dates are rounded to "2017-02-16" etc.
This may be a cheat, but assuming each row/observation is a separate day, then why not just group by every 14 rows and sum.
# Assign interval groups, each 14 rows
df.meteoro$my_group <-rep(1:100, each=14, length.out=nrow(df.meteoro))
# Grab Interval Names
my_interval_names <- df.meteoro %>%
select(-rain) %>%
group_by(my_group) %>%
slice(1)
# Summarise
df.meteoro %>%
group_by(my_group) %>%
summarise(rain = sum(rain)) %>%
left_join(., my_interval_names)
#> Joining, by = "my_group"
#> # A tibble: 27 x 3
#> my_group rain Dates
#> <int> <dbl> <date>
#> 1 1 3.86 2017-01-19
#> 2 2 -0.581 2017-02-02
#> 3 3 -0.876 2017-02-16
#> 4 4 1.80 2017-03-02
#> 5 5 3.79 2017-03-16
#> 6 6 -3.50 2017-03-30
#> 7 7 5.31 2017-04-13
#> 8 8 2.57 2017-04-27
#> 9 9 -1.33 2017-05-11
#> 10 10 5.41 2017-05-25
#> # ... with 17 more rows
Created on 2018-03-01 by the reprex package (v0.2.0).

Using adply in data.table

I have a big data.table that looks like:
dt<-data.table(start=c("2012-07-13 23:45:00", "2012-07-14 15:30:00",
"2012-07-14 23:57:00"),
end=c("2012-07-14 00:02:00", "2012-07-14 15:35:00",
"2012-07-15 00:05:00"), id=c(1,2,1),cat=c("a","b","a"))
dt
start end id cat
1: 2012-07-13 23:45:00 2012-07-14 00:02:00 1 a
2: 2012-07-14 15:30:00 2012-07-14 15:35:00 2 b
3: 2012-07-14 23:57:00 2012-07-15 00:05:00 1 a
I need to get an output that shows total minutes of event on each calendar day by id and category. Using the example above the output should be:
day id cat V1
1: 13.07.2012 1 a 15
2: 14.07.2012 1 a 5
3: 14.07.2012 2 b 5
4: 15.07.2012 1 a 5
I used adply function from plyr package to split duration in intervals by minute:
fn<-function(x){
s<-seq(from = as.POSIXct(x$start),
to = as.POSIXct(x$end)-1,by = "mins")
# here s is a sequence of all minutes in the given interval
df<-data.table(x$id,x$cat,s)
# return new data.table that contains each calendar minute for each id
# and categoryy of the original data
df
}
# run the function above for each row in the data.table
dd<-adply(dt,1,fn)
# extract the date from calendar minutes
dd[,day:=format(as.POSIXct(s,"%d.%m.%Y %H:%M%:%S"), "%d.%m.%Y")]
#calculate sum of all minutes of event for each day, id and category
dd[,.N,by=c("day","id","cat")][order(day,id,cat)]
The solution above perfectly suits my needs except the time it takes for calculation. When adply is run in a very big data and several categories defined in fn function, it feels like CPU runs forever.
I will highly appreciate any hint on how to use pure data.table functionality in this problem.
I would suggest a few things
Convert to as.POSIXct only once instead of per each row.
instead of adply which creates a whole data.table in each iteration, just use by within the data.table scope.
In order to do so, simple create an row index using .I
Here's a quick attempt (I've used substr because it will be probably faster than as.Date or as.POSIXct. If you want it to be Date class again, use res[, Date := as.IDate(Date)] on the result istead of doing it by group).
dt[, `:=`(start = as.POSIXct(start), end = as.POSIXct(end), indx = .I)]
dt[, seq(start, end - 1L, by = "mins"), by = .(indx, id, cat)
][, .N, by = .(Date = substr(V1, 1L, 10L), id, cat)]
# Date id cat N
# 1: 2012-07-13 1 a 15
# 2: 2012-07-14 1 a 5
# 3: 2012-07-14 2 b 5
# 4: 2012-07-15 1 a 5
Try to see if this is faster.
It's still data.table in the background, but I'm using a dplyr syntax for the process.
library(data.table)
dt<-data.table(start=c("2012-07-13 23:45:00", "2012-07-14 15:30:00",
"2012-07-14 23:57:00"),
end=c("2012-07-14 00:02:00", "2012-07-14 15:35:00",
"2012-07-15 00:05:00"), id=c(1,2,1),cat=c("a","b","a"))
fn<-function(x){
s<-seq(from = as.POSIXct(x$start),
to = as.POSIXct(x$end)-1,by = "mins")
# here s is a sequence of all minutes in the given interval
df<-data.table(x$id,x$cat,s)
# return new data.table that contains each calendar minute for each id
# and categoryy of the original data
df
}
library(dplyr)
dt %>%
rowwise() %>% # for each row
do(fn(.)) %>% # apply your function
select(day=s, id=V1, cat=V2) %>% # rename columns
mutate(day = substr(day,1,10)) %>% # keep only the day
ungroup %>%
group_by(day,id,cat) %>%
summarise(N=n()) %>%
ungroup
# Source: local data frame [4 x 4]
#
# day id cat N
# (chr) (dbl) (chr) (int)
# 1 2012-07-13 1 a 15
# 2 2012-07-14 1 a 5
# 3 2012-07-14 2 b 5
# 4 2012-07-15 1 a 5

Resources