Identify events within a time window in R - r

I need to identify a series (maximum 3 events) of events that occurred within 60 seconds.
Here there is the IN data
IN<-read.table(header = FALSE, text = "
2018-06-01_04:29:47
2018-06-01_05:44:41
2018-06-01_05:44:43
2018-06-01_05:44:45
2018-06-01_05:57:54
2018-06-01_05:57:56
2018-06-01_05:57:58
2018-06-01_08:10:35
2018-06-01_08:41:20
2018-06-01_08:41:22
2018-06-01_08:41:24
2018-06-01_08:52:01
2018-06-01_09:02:13
2018-06-01_09:22:45", quote="\n",col.names="time")
IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")
and here there is the desired output
OUT<-read.table(header = FALSE, text = "
2018-06-01_04:29:47 1
2018-06-01_05:44:41 1
2018-06-01_05:44:43 2
2018-06-01_05:44:45 3
2018-06-01_05:57:54 1
2018-06-01_05:57:56 2
2018-06-01_05:57:58 3
2018-06-01_08:10:35 1
2018-06-01_08:41:20 1
2018-06-01_08:41:22 2
2018-06-01_08:41:24 3
2018-06-01_08:52:01 1
2018-06-01_09:02:13 1
2018-06-01_09:22:45 1
",quote="\n",col.names=c("time","response"))
I have searched for similar questions, but unsuccessfully.
I guess that function diff is the first step for solving this problem,
response<-as.numeric(diff(IN$time)>60)
but than I have no idea how to proceed to get the desired output.
Any helps will be appreciated.

Here's a solution using dplyr, magrittr, and lubridate packages.
IN<-read.table(header = FALSE, text = "2018-06-01_04:29:47
2018-06-01_05:44:41
2018-06-01_05:44:43
2018-06-01_05:44:45
2018-06-01_05:57:54
2018-06-01_05:57:56
2018-06-01_05:57:58
2018-06-01_08:10:35
2018-06-01_08:41:20
2018-06-01_08:41:22
2018-06-01_08:41:24
2018-06-01_08:52:01
2018-06-01_09:02:13
2018-06-01_09:22:45", quote="\n",col.names="time")
IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")
I've removed the blank first line of the input data frame, as it caused problems. The following function filters the data frame to those elements within 60 seconds before the given ref_time and counts the number of rows using nrow.
event_count <- function(ref_time){
IN %>% filter(time %within% interval(ref_time - 60, ref_time)) %>% nrow
}
Here, I apply the function in a row-wise fashion, record the counts, and sort according to time. (Probably unnecessary...) The results are piped back in to the input data frame using the compound assignment pipe from magrittr.
IN %<>%
rowwise() %>%
mutate(counts = event_count(time)) %>%
arrange(time)
Finally, the results.
# A tibble: 14 x 2
# time counts
# <dttm> <int>
# 1 2018-06-01 04:29:47 1
# 2 2018-06-01 05:44:41 1
# 3 2018-06-01 05:44:43 2
# 4 2018-06-01 05:44:45 3
# 5 2018-06-01 05:57:54 1
# 6 2018-06-01 05:57:56 2
# 7 2018-06-01 05:57:58 3
# 8 2018-06-01 08:10:35 1
# 9 2018-06-01 08:41:20 1
# 10 2018-06-01 08:41:22 2
# 11 2018-06-01 08:41:24 3
# 12 2018-06-01 08:52:01 1
# 13 2018-06-01 09:02:13 1
# 14 2018-06-01 09:22:45 1
I think what #PoGibas is alluding to is for some reason there are two entries with the time 2018-06-01 05:57:54 in the input data frame. I'm not sure where the second comes from...
EDIT: It's the new line in the read table that messes it up.
EDIT²: This returns a maximum of 3...
event_count <- function(ref_time){
min(IN %>% filter(time %within% interval(ref_time - 60, ref_time)) %>% nrow, 3)
}

Here's a data frame with some edge cases:
IN<-read.table(header = FALSE, text = "2018-06-01_04:29:47
2018-06-01_05:44:41
2018-06-01_05:44:43
2018-06-01_05:44:45
2018-06-01_05:44:47
2018-06-01_05:57:54
2018-06-01_05:57:56
2018-06-01_05:57:58
2018-06-01_05:58:56
2018-06-01_08:10:35
2018-06-01_08:41:20
2018-06-01_08:41:22
2018-06-01_08:41:24
2018-06-01_08:52:01
2018-06-01_09:02:13
2018-06-01_09:22:45", quote="\n",col.names="time")
IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")
IN
time
1 2018-06-01 04:29:47
2 2018-06-01 05:44:41
3 2018-06-01 05:44:43
4 2018-06-01 05:44:45
5 2018-06-01 05:44:47
6 2018-06-01 05:57:54
7 2018-06-01 05:57:56
8 2018-06-01 05:57:58
9 2018-06-01 05:58:56
10 2018-06-01 08:10:35
11 2018-06-01 08:41:20
12 2018-06-01 08:41:22
13 2018-06-01 08:41:24
14 2018-06-01 08:52:01
15 2018-06-01 09:02:13
16 2018-06-01 09:22:45
You'll notice line 9 is a minute after the mid-group time but not the reference time. Line 5 is also the 4th member of what would be a group if there were no limits imposed.
Here's my solution using dplyr. I think it works generally speaking:
res <- IN %>% mutate(diffs = as.numeric(time - lag(time)),
helper1 = case_when(is.na(diffs) ~ 1,
diffs <= 60 ~ 0 ,
TRUE ~ 1),
grouper1 = cumsum(helper1)) %>%
group_by(grouper1) %>%
mutate(helper2 = cumsum(diffs) - first(diffs),
helper3 = helper2 %/% 60,
helper4 = helper1 + if_else(is.na(helper3), 0, helper3)) %>%
ungroup() %>%
mutate(grouper2 = cumsum(helper4)) %>%
group_by(grouper2) %>%
mutate(rn0 = row_number() - 1,
grouper3 = rn0 %/% 3) %>%
group_by(grouper2, grouper3) %>%
mutate(count = row_number()) %>%
ungroup() %>%
select(time, count)
the result:
> res
# A tibble: 16 x 2
time count
<dttm> <int>
1 2018-06-01 04:29:47 1
2 2018-06-01 05:44:41 1
3 2018-06-01 05:44:43 2
4 2018-06-01 05:44:45 3
5 2018-06-01 05:44:47 1
6 2018-06-01 05:57:54 1
7 2018-06-01 05:57:56 2
8 2018-06-01 05:57:58 3
9 2018-06-01 05:58:56 1
10 2018-06-01 08:10:35 1
11 2018-06-01 08:41:20 1
12 2018-06-01 08:41:22 2
13 2018-06-01 08:41:24 3
14 2018-06-01 08:52:01 1
15 2018-06-01 09:02:13 1
16 2018-06-01 09:22:45 1
I think i structured the dplyr calls in a way where you can follow them, but if you have questions feel free to post in comments.

Related

R create week numbers with specified start date

This seems like it should be straightforward but I cannot find a way to do this.
I have a sales cycle that begins ~ August 1 of each year and need to sum sales by week number. I need to create a "week number" field where week #1 begins on a date that I specify. Thus far I have looked at lubridate, baseR, and strftime, and I cannot find a way to change the "start" date from 01/01/YYYY to something else.
Solution needs to let me specify the start date and iterate week numbers as 7 days from the start date. The actual start date doesn't always occur on a Sunday or Monday.
EG Data Frame
eg_data <- data.frame(
cycle = c("cycle2019", "cycle2019", "cycle2018", "cycle2018", "cycle2017", "cycle2017", "cycle2016", "cycle2016"),
dates = as.POSIXct(c("2019-08-01" , "2019-08-10" ,"2018-07-31" , "2018-08-16", "2017-08-03" , "2017-08-14" , "2016-08-05", "2016-08-29")),
week_n = c("1", "2","1","3","1","2","1","4"))
I'd like the result to look like what is above - it would take the min date for each cycle and use that as a starting point, then iterate up week numbers based on a given date's distance from the cycle starting date.
This almost works. (Doing date arithmetic gives us durations in seconds: there may be a smoother way to convert with lubridate tools?)
secs_per_week <- 60*60*24*7
(eg_data
%>% group_by(cycle)
%>% mutate(nw=1+as.numeric(round((dates-min(dates))/secs_per_week)))
)
The results don't match for 2017, because there is an 11-day gap between the first and second observation ...
cycle dates week_n nw
<chr> <dttm> <chr> <dbl>
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 3
If someone has a better answer plz post, but this works -
Take the dataframe in the example, eg_data -
eg_data %>%
group_by(cycle) %>%
mutate(
cycle_start = as.Date(min(dates)),
days_diff = as.Date(dates) - cycle_start,
week_n = days_diff / 7,
week_n_whole = ceiling(days_diff / 7) ) -> eg_data_check
(First time I've answered my own question)
library("lubridate")
eg_data %>%
as_tibble() %>%
group_by(cycle) %>%
mutate(new_week = week(dates)-31)
This doesn't quite work the same as your example, but perhaps with some fiddling based on your domain experience you could adapt it:
library(lubridate)
eg_data %>%
mutate(aug1 = ymd_h(paste(str_sub(cycle, start = -4), "080100")),
week_n2 = ceiling((dates - aug1)/ddays(7)))
EDIT: If you have specific known dates for the start of each cycle, it might be helpful to join those dates to your data for the calc:
library(lubridate)
cycle_starts <- data.frame(
cycle = c("cycle2019", "cycle2018", "cycle2017", "cycle2016"),
start_date = ymd_h(c(2019080100, 2018072500, 2017080500, 2016071300))
)
eg_data %>%
left_join(cycle_starts) %>%
mutate(week_n2 = ceiling((dates - start_date)/ddays(7)))
#Joining, by = "cycle"
# cycle dates week_n start_date week_n2
#1 cycle2019 2019-08-01 1 2019-08-01 1
#2 cycle2019 2019-08-10 2 2019-08-01 2
#3 cycle2018 2018-07-31 1 2018-07-25 1
#4 cycle2018 2018-08-16 3 2018-07-25 4
#5 cycle2017 2017-08-03 1 2017-08-05 0
#6 cycle2017 2017-08-14 2 2017-08-05 2
#7 cycle2016 2016-08-05 1 2016-07-13 4
#8 cycle2016 2016-08-29 4 2016-07-13 7
This is a concise solution using lubridate
library(lubridate)
eg_data %>%
group_by(cycle) %>%
mutate(new_week = floor(as.period(ymd(dates) - ymd(min(dates))) / weeks()) + 1)
# A tibble: 8 x 4
# Groups: cycle [4]
cycle dates week_n new_week
<chr> <dttm> <chr> <dbl>
1 cycle2019 2019-08-01 00:00:00 1 1
2 cycle2019 2019-08-10 00:00:00 2 2
3 cycle2018 2018-07-31 00:00:00 1 1
4 cycle2018 2018-08-16 00:00:00 3 3
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 2
7 cycle2016 2016-08-05 00:00:00 1 1
8 cycle2016 2016-08-29 00:00:00 4 4

Creating a Survival Analysis dataset

I have a table composed by three columns: ID, Opening Date and Cancelation Date.
What I want to do is to create 36 observations per client (one per month for 3 years) as a dummy variable. Basically, i want all the months observations before the cancelation date to have a 1 and the others a 0. In case that the cancelation date is null, then all of the values would be 1.
This process should be repeated for every ID.
The desired output would be a table with five columns: ID, Opening Date, Cancelation Date, Month (from 1 to 36, starting on opening date) and Status (1 or 0).
I've tried everything but havent managed to solve this problem, using seq() to create the dates and order them seq(table$Opening, by = "month", length.out = 36) and many other ways.
We can use complete from tidyr to create a dates of 1-month sequence for each ID, create a row_number for each group as count of Month and create Status based on Cancellation_Date.
library(dplyr)
library(tidyr)
df %>%
mutate_at(vars(ends_with("Date")), as.Date, "%d/%m/%y") %>%
mutate(Date = Opening_Date) %>%
group_by(ID) %>%
complete(Date = seq(Date,by = "1 month", length.out = 36)) %>%
mutate(Month = row_number()) %>%
fill(Opening_Date, Cancellation_Date) %>%
mutate(Status = +(Date <= Cancellation_Date))
# ID Date Opening_Date Cancellation_Date Month Status
# <dbl> <date> <date> <date> <int> <int>
# 1 336 2017-01-01 2017-01-01 2018-06-01 1 1
# 2 336 2017-02-01 2017-01-01 2018-06-01 2 1
# 3 336 2017-03-01 2017-01-01 2018-06-01 3 1
# 4 336 2017-04-01 2017-01-01 2018-06-01 4 1
# 5 336 2017-05-01 2017-01-01 2018-06-01 5 1
# 6 336 2017-06-01 2017-01-01 2018-06-01 6 1
# 7 336 2017-07-01 2017-01-01 2018-06-01 7 1
# 8 336 2017-08-01 2017-01-01 2018-06-01 8 1
# 9 336 2017-09-01 2017-01-01 2018-06-01 9 1
#10 336 2017-10-01 2017-01-01 2018-06-01 10 1
# … with 26 more rows
In the output Date column is sequence of monthly dates for each ID, which can be removed from the final output if not needed.
data
df <- data.frame(ID = 336, Opening_Date = '1/1/17',Cancellation_Date = '1/6/18')

Aggregate date time to summarize time spent at certain 'repeating' conditions

Good day,
This is a continuation question to this post
Here are some dummy data:
Date <- as.POSIXct(c('2018-03-20 11:52:25', '2018-03-22 12:01:44', '2018-03-20 12:05:25', '2018-03-20 12:10:40', '2018-03-20 12:12:51 ', '2018-03-21 2:01:23', '2018-03-21 2:45:01', '2018-03-21 3:30:00', '2018-03-21 3:45:00', '2018-03-21 5:00:00', '2018-03-21 5:45:00'))
Sites<-c(4, 4, 4, 6, 6, 7, 7, 4, 4, 6, 6)
Individual<-c("A", "A", "A", "A", "A", "A", "A", "A", "A", "A","A")
data<-data.frame(Individual, Date, Sites)
Individual Date Sites
A 2018-03-20 11:52:25 4
A 2018-03-22 12:01:44 4
A 2018-03-20 12:05:25 4
A 2018-03-20 12:10:40 6
A 2018-03-20 12:12:51 6
A 2018-03-21 02:01:23 7
A 2018-03-21 02:45:01 7
A 2018-03-21 03:30:00 4
A 2018-03-21 03:45:00 4
A 2018-03-21 05:00:00 6
A 2018-03-21 05:45:00 6
Basically, I would like R to tell me how much time is spent at each site. The data above, have repeating instances at sites and I would like R to tease out the repetitions and add the time differences for each.
I have tried the following code:
data.summary<-data %>%
group_by(Individual, Sites) %>%
summarise(time_spent = max(Date)-min(Date))
But this will take a time difference from just the minimum date at that site and the maximum date at the site, not accounting for instances of repetition, or times that the individual is at other sites.
Digging further into the dummy data, the summarize code says that individual A spent 2 days at site 4. However that individual left site 4 and reentered the site at a later date and should have a total time at site 4 of 28 minutes. How can I get R to reflect repetitive entries for that site?
Date1<-as.POSIXct("2018-03-20 11:52:25") # First instance at site 4
Date2<-as.POSIXct("2018-03-20 12:05:25") # Last time A spent at site 4 before leaving
difftime(Date2, Date1, units="mins")
# time diff = 13 minutes
# Second instance at site 4
Date3<-as.POSIXct("2018-03-21 03:30:00") # Second instance at site 4
Date4<-as.POSIXct("2018-03-21 03:45:00") # Last time A spent at site 4
difftime(Date4, Date3, units="mins")
# time diff= 15 mins
Thanks!
EDIT: I'm finding an issue with dplyr summarise, where extra time is being added. Here are dummy data:
Dates<-as.POSIXct(c("2018-04-09 16:59:03",
"2018-04-09 18:27:23",
"2018-04-09 17:01:20",
"2018-04-09 17:41:17"))
Individual<-c("A","A","A","A")
Site<-c("40","40","40", "40")
data<-data.frame(Dates, Individual, Site)
I want to summarize the time spent at site 40, with the minimum time stamp at this site subtracted from the maximum time stamp at the site
data %>%
group_by(Individual) %>%
arrange(Dates) %>%
group_by(Individual, Site) %>%
summarise(time_spent = max(Dates) - min(Dates))
# A tibble: 1 x 3
# Groups: Individual [?]
Individual Site time_spent
<fct> <fct> <time>
1 A 40 1.472222 hours
This says the total time spent at this site is 1.47 hours. However, when I manually get a time difference I get an entirely different value.
maxtime<-("2018-04-09 17:41:17")
mintime<-("2018-04-09 16:59:03")
difftime(maxtime, mintime, units="hours")
# Time difference of 0.7038889 hours
The actual time at site 40 is 0.70 hours. I'm not quite sure what summarise is referencing, or why extra time is being added.
EDIT 2: Okay, this looks like a units issue! Here is more reproducible data:
Dates<-as.POSIXct(c("2018-04-09 16:43:44","2018-03-20 11:52:25", "2018-04-09 16:59:03",
"2018-04-09 18:27:23",
"2018-04-09 17:01:20",
"2018-04-09 17:41:17"))
Individual<-c("A","A","A","A", "A","A")
Site<-c("38","38", "40","40","40", "40")
data<-data.frame(Dates, Individual, Site)
Dates Individual Site
1 2018-04-09 16:43:44 A 38
2 2018-03-20 11:52:25 A 38
3 2018-04-09 16:59:03 A 40
4 2018-04-09 18:27:23 A 40
5 2018-04-09 17:01:20 A 40
6 2018-04-09 17:41:17 A 40
data %>%
group_by(Individual) %>%
arrange(Dates) %>%
group_by(Individual, Site) %>%
summarise(time_spent = max(Dates) - min(Dates))
# A tibble: 2 x 3
# Groups: Individual [?]
Individual Site time_spent
<fct> <fct> <time>
1 A 38 20.202303 days
2 A 40 1.472222 days
Here, it says time spent at site 40 is 1.47 days, but this should be hours! According to manually finding time differences below:
maxtime<-("2018-04-09 18:27:23")
mintime<-("2018-04-09 16:59:03")
difftime(maxtime, mintime, units="hours")
# Time difference of 1.472222 hours
How can I correct this units issue? Instead of displaying hours intermixed with days, I would like R to calculate the time for all sites in days.
EDITED SOLUTION: after some trial and error this is what ended up working. This uses a function from data.table so you'll need to have that installed.
Step 1: create a unique ID for all site observations (by site), ordered by Date
data %>%
arrange(Individuals, Dates) %>%
mutate(rle_id = data.table::rleid(Sites))
Dates Individuals Sites rle_id
1 2018-03-20 11:52:25 A 38 1
2 2018-04-09 16:43:44 A 38 1
3 2018-04-09 16:59:03 A 40 2
4 2018-04-09 17:01:20 A 40 2
5 2018-04-09 17:41:17 A 40 2
6 2018-04-09 18:27:23 A 40 2
7 2018-03-20 11:52:25 B 4 3
8 2018-03-20 12:05:25 B 4 3
9 2018-03-20 12:10:40 B 6 4
10 2018-03-20 12:12:51 B 6 4
11 2018-03-21 02:01:23 B 7 5
12 2018-03-21 02:45:01 B 7 5
13 2018-03-21 03:30:00 B 4 6
14 2018-03-21 03:45:00 B 4 6
15 2018-03-21 05:00:00 B 6 7
16 2018-03-21 05:45:00 B 6 7
17 2018-03-22 12:01:44 B 4 8
You could get the relid using something in base like what I have pasted below, but it is probably much slower (and harder to understand)
data <- data[order(data$Dates),]
rle_lengths <- rle(data$Sites)$lengths
unlist(Map(rep, 1:length(rle_lengths), rle_lengths))
[1] 1 2 2 3 3 4 4 5 5 6 6 7 8 9 9 9 9
vs.
data.table::rleid(data$Sites)
[1] 1 2 2 3 3 4 4 5 5 6 6 7 8 9 9 9 9
Step 2: get the time for individual A and B at each site. If we did not specify the units in difftime, it will do the calculation on individual units and display a common unit. E.g., 1.5 hours becomes 1.5 days if there is a someone there for several days.
data %>%
arrange(Individuals, Dates) %>%
mutate(rle_id = data.table::rleid(Sites)) %>%
group_by(Individuals, rle_id, Sites) %>%
summarise(time_spent = difftime(max(Dates), min(Dates), units = "days"))
# A tibble: 8 x 4
# Groups: Individuals, rle_id [8]
Individuals rle_id Sites time_spent
<fct> <int> <dbl> <time>
1 A 1 38 20.202303241 days
2 A 2 40 0.061342593 days
3 B 3 4 0.009027778 days
4 B 4 6 0.001516204 days
5 B 5 7 0.030300926 days
6 B 6 4 0.010416667 days
7 B 7 6 0.031250000 days
8 B 8 4 0.000000000 days
Step 3 (full solution): collapse across sites
data %>%
arrange(Individuals, Dates) %>%
mutate(rle_id = data.table::rleid(Sites)) %>%
group_by(Individuals, rle_id, Sites) %>%
summarise(time_spent = difftime(max(Dates), min(Dates), units = "days")) %>%
group_by(Individuals, Sites) %>%
summarise(time_spent_new = sum(time_spent))
# A tibble: 5 x 3
# Groups: Individuals [2]
Individuals Sites time_spent_new
<fct> <dbl> <time>
1 A 38 20.20230324 days
2 A 40 0.06134259 days
3 B 4 0.01944444 days
4 B 6 0.03276620 days
5 B 7 0.03030093 days
Data
Date <-as.POSIXct(c("2018-04-09 16:43:44","2018-03-20 11:52:25", "2018-04-09 16:59:03",
"2018-04-09 18:27:23","2018-04-09 17:01:20", "2018-04-09 17:41:17",
'2018-03-20 11:52:25', '2018-03-22 12:01:44', '2018-03-20 12:05:25',
'2018-03-20 12:10:40', '2018-03-20 12:12:51 ', '2018-03-21 2:01:23',
'2018-03-21 2:45:01', '2018-03-21 3:30:00', '2018-03-21 3:45:00',
'2018-03-21 5:00:00', '2018-03-21 5:45:00'))
Individual<-c(rep("A", 6), rep("B", 11))
Site<-c(38, 38, 40, 40, 40, 40, 4, 4, 4, 6, 6, 7, 7, 4, 4, 6, 6)
data<-data.frame(Dates = Date, Individuals = Individual, Sites = Site)

Corresponding last value in every minute

I want to extract the corresponding last value in every minute say, in a table "Table":
Value Time
1 5/1/2018 15:50:57
5 5/1/2018 15:50:58
21 5/1/2018 15:51:48
22 5/1/2018 15:51:49
5 5/1/2018 15:52:58
8 5/1/2018 15:52:59
71 5/1/2018 15:53:45
33 5/1/2018 15:53:50
I need the corresponding last "Value" at the end of each minute in "Time". That is:
I want the output values to be: 5, 22, 8, 33
I tried using "as.POSIXct" to find Table$Time value but I am not able to proceed.
1) aggregate Using DF shown reproducibly in the Note at the end, truncate each time to the minute and then aggregate based on that:
aggregate(Value ~ Minute, transform(DF, Minute = trunc(Time, "min")), tail, 1)
giving:
Minute Value
1 2018-05-01 15:59:00 5
2 2018-05-01 16:59:00 22
3 2018-05-01 17:59:00 8
4 2018-05-01 18:59:00 33
2) subset An alternative, depending on what output you want, is to truncate the times to minutes and then remove those rows for which there are duplicate truncated times proceeding backwards from the end.
subset(DF, !duplicated(trunc(Time, "min"), fromLast = TRUE))
giving:
Value Time
2 5 2018-05-01 15:59:58
4 22 2018-05-01 16:59:49
6 8 2018-05-01 17:59:59
8 33 2018-05-01 18:59:50
Note
We assume the following input shown reproducibly. Note that we have converted the Time column to POSIXct class.
Lines <- "
Value Time
1 5/1/2018 15:59:57
5 5/1/2018 15:59:58
21 5/1/2018 16:59:48
22 5/1/2018 16:59:49
5 5/1/2018 17:59:58
8 5/1/2018 17:59:59
71 5/1/2018 18:59:45
33 5/1/2018 18:59:50"
Lines2 <- sub(" ", ",", trimws(readLines(textConnection(Lines))))
DF <- read.csv(text = Lines2)
DF$Time <- as.POSIXct(DF$Time, format = "%m/%d/%Y %H:%M:%S")
Very similar to #G.Grothendieck, but with using format instead, i.e.
aggregate(Value ~ format(Time, '%Y-%m-%d %H:%M:00'), df, tail, 1)
# format(Time, "%Y-%m-%d %H:%M:00") Value
#1 2018-05-01 15:50:00 5
#2 2018-05-01 15:51:00 22
#3 2018-05-01 15:52:00 8
#4 2018-05-01 15:53:00 33
Building on # Grothendieck's great answer I provide a tidyverse solution.
library(dplyr)
Lines <- "
Value Time
1 5/1/2018 15:50:57
5 5/1/2018 15:50:58
21 5/1/2018 16:51:48
22 5/1/2018 16:51:49
5 5/1/2018 17:52:58
8 5/1/2018 17:52:59
71 5/1/2018 18:53:45
33 5/1/2018 18:53:50"
Lines2 <- sub(" ", ",", readLines(textConnection(Lines)))
DF <- read.csv(text = Lines2) %>% tibble::as_tibble()
# after creating reproducible data set. Set Time to date-time format
# then floor the time to nearest minute
DF %>%
dplyr::mutate(Time = lubridate::dmy_hms(Time),
minute = lubridate::floor_date(Time, "minute")) %>%
# Group by minute
dplyr::group_by(minute) %>%
# arrange by time
dplyr::arrange(Time) %>%
# extract the last row in each group
dplyr::filter(dplyr::row_number() == n())
Output
# A tibble: 4 x 3
# Groups: min [4]
Value Time min
<int> <dttm> <dttm>
1 5 2018-01-05 15:50:58 2018-01-05 15:50:00
2 22 2018-01-05 16:51:49 2018-01-05 16:51:00
3 8 2018-01-05 17:52:59 2018-01-05 17:52:00
4 33 2018-01-05 18:53:50 2018-01-05 18:53:00

Count how many cases exist per week given start and end dates of each case [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I'm new here, so I apologize if I miss any conventions.
I have a ~2000 row dataset with data on unique cases happening in a three year period. Each case has a start date and an end date. I want to be able to get a new dataframe that shows how many cases occur per week in this three year period.
The structure of the dataset I have is like this:
ID Start_Date End_Date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03
`
This problem can be solved more easily with sqldf package but I thought to stick with dplyr package.
The approach:
library(dplyr)
library(lubridate)
# First create a data frame having all weeks from chosen start date to end date.
# 2015-01-01 to 2017-12-31
df_week <- data.frame(weekStart = seq(floor_date(as.Date("2015-01-01"), "week"),
as.Date("2017-12-31"), by = 7))
df_week <- df_week %>%
mutate(weekEnd = weekStart + 7,
weekNum = as.character(weekStart, "%V-%Y"),
dummy = TRUE)
# The dummy column is only for joining purpose.
# Header looks like
#> head(df_week)
# weekStart weekEnd weekNum dummy
#1 2014-12-28 2015-01-04 52-2014 TRUE
#2 2015-01-04 2015-01-11 01-2015 TRUE
#3 2015-01-11 2015-01-18 02-2015 TRUE
#4 2015-01-18 2015-01-25 03-2015 TRUE
#5 2015-01-25 2015-02-01 04-2015 TRUE
#6 2015-02-01 2015-02-08 05-2015 TRUE
# Prepare the data as mentioned in OP
df <- read.table(text = "ID Start_Date End_Date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03", header = TRUE, stringsAsFactors = FALSE)
df$Start_Date <- as.Date(df$Start_Date)
df$End_Date <- as.Date(df$End_Date)
df <- df %>% mutate(dummy = TRUE) # just for joining
# Use dplyr to join, filter and then group on week to find number of cases
# in each week
df_week %>%
left_join(df, by = "dummy") %>%
select(-dummy) %>%
filter((weekStart >= Start_Date & weekStart <= End_Date) |
(weekEnd >= Start_Date & weekEnd <= End_Date)) %>%
group_by(weekStart, weekEnd, weekNum) %>%
summarise(cases = n())
# Result
# weekStart weekEnd weekNum cases
# <date> <date> <chr> <int>
# 1 2014-12-28 2015-01-04 52-2014 1
# 2 2015-01-04 2015-01-11 01-2015 3
# 3 2015-01-11 2015-01-18 02-2015 5
# 4 2015-01-18 2015-01-25 03-2015 8
# 5 2015-01-25 2015-02-01 04-2015 8
# 6 2015-02-01 2015-02-08 05-2015 8
# 7 2015-02-08 2015-02-15 06-2015 8
# 8 2015-02-15 2015-02-22 07-2015 8
# 9 2015-02-22 2015-03-01 08-2015 8
#10 2015-03-01 2015-03-08 09-2015 8
# ... with 139 more rows
Welcome to SO!
Before solving the problem be sure to have installed some packages and run
install.packages(c("tidyr","dplyr","lubridate"))
if you haven installed those packages yet.
I'll present you a modern R solution next and those packages are magic.
This is a way to solve it:
library(readr)
library(dplyr)
library(lubridate)
raw_data <- 'id start_date end_date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03'
curated_data <- read_delim(raw_data, delim = "\t") %>%
mutate(start_date = as.Date(start_date)) %>% # convert column 2 to date format assuming the date is yyyy-mm-dd
mutate(weeks_lapse = as.integer((start_date - min(start_date))/dweeks(1))) # count how many weeks passed since the lowest date in the data
curated_data %>%
group_by(weeks_lapse) %>% # I group to count by week
summarise(cases_per_week = n()) # now count by group by week
And the solution is:
# A tibble: 3 x 2
weeks_lapse cases_per_week
<int> <int>
1 0 3
2 1 2
3 2 3

Resources