Calculate time difference between two events while disregarding unmatched events - r

I have a data set with a structure such as this:
structure(list(id = c(43956L, 46640L, 71548L, 71548L, 71548L,
72029L, 72029L, 74558L, 74558L, 100596L, 100596L, 100596L, 104630L,
104630L, 104630L, 104630L, 104630L, 104630L, 104630L, 104630L
), event = c("LOGIN", "LOGIN", "LOGIN", "LOGIN", "LOGOUT", "LOGIN",
"LOGOUT", "LOGIN", "LOGOUT", "LOGIN", "LOGOUT", "LOGIN", "LOGIN",
"LOGIN", "LOGIN", "LOGIN", "LOGIN", "LOGOUT", "LOGIN", "LOGOUT"
), timestamp = c("2017-03-27 09:19:29", "2016-06-10 00:09:08",
"2016-01-27 12:00:25", "2016-06-20 11:34:29", "2016-06-20 11:35:44",
"2016-12-28 10:43:25", "2016-12-28 10:56:30", "2016-10-15 15:08:39",
"2016-10-15 15:10:06", "2016-03-09 14:30:48", "2016-03-09 14:31:10",
"2017-04-03 10:36:54", "2016-01-11 16:52:08", "2016-02-03 14:40:32",
"2016-03-30 12:34:56", "2016-05-26 13:14:25", "2016-08-22 15:20:02",
"2016-08-22 15:21:53", "2016-08-22 15:22:23", "2016-08-22 15:23:08"
)), .Names = c("id", "event", "timestamp"), row.names = c(5447L,
5446L, 5443L, 5444L, 5445L, 5441L, 5442L, 5439L, 5440L, 5436L,
5437L, 5438L, 5425L, 5426L, 5427L, 5428L, 5429L, 5430L, 5431L,
5432L), class = "data.frame")
id event timestamp
5447 43956 LOGIN 2017-03-27 09:19:29
5446 46640 LOGIN 2016-06-10 00:09:08
5443 71548 LOGIN 2016-01-27 12:00:25
5444 71548 LOGIN 2016-06-20 11:34:29
5445 71548 LOGOUT 2016-06-20 11:35:44
5441 72029 LOGIN 2016-12-28 10:43:25
5442 72029 LOGOUT 2016-12-28 10:56:30
5439 74558 LOGIN 2016-10-15 15:08:39
5440 74558 LOGOUT 2016-10-15 15:10:06
5436 100596 LOGIN 2016-03-09 14:30:48
5437 100596 LOGOUT 2016-03-09 14:31:10
5438 100596 LOGIN 2017-04-03 10:36:54
5425 104630 LOGIN 2016-01-11 16:52:08
5426 104630 LOGIN 2016-02-03 14:40:32
5427 104630 LOGIN 2016-03-30 12:34:56
5428 104630 LOGIN 2016-05-26 13:14:25
5429 104630 LOGIN 2016-08-22 15:20:02
5430 104630 LOGOUT 2016-08-22 15:21:53
5431 104630 LOGIN 2016-08-22 15:22:23
5432 104630 LOGOUT 2016-08-22 15:23:08
I wish to calculate the time difference between LOGIN and LOGOUT (session duration) as well as between LOGOUT and LOGIN (session interval). Unfortunately, I have LOGIN events that do not have a matching LOGOUT event.
The correct LOGOUT event always follows its' corresponding LOGIN event (as I ordered the data frame based on id and timestamp. I tried adapting this answer, but have had no luck. I also tried creating an event identifier, but since I can't find a way to get the numbering for the LOGOUT event to match the numbering for the LOGIN event, I am unsure as to how useful such an identifier will be:
df$eventNum <- as.numeric(ave(as.character(df$id), df$id, as.character(df$event), FUN = seq_along))

Here's the approach I'd take:
First, I'd convert the event variable to an ordered factor, because it makes sense to think of its values this way (i.e. Login < Logout, in terms of order), and because it will enable easier comparison between rows:
df$event <- factor(df$event, levels = c("LOGIN", "LOGOUT"), ordered = T)
Then, assuming that timestamp is in a viable format, as this would provide:
df$timestamp <- lubridate::parse_date_time(df$timestamp, "%Y-%m-%d %H:%M:%S")
You can conditionally mutate your data.frame by grouping by ID and then calling mutate with ifelse functions:
df %>% group_by(id) %>% mutate(
timeElapsed = ifelse(event != lag(event), lubridate::seconds_to_period(timestamp - lag(timestamp)), NA),
eventType = ifelse(event > lag(event), 'Duration', ifelse(event < lag(event), 'Interval', NA))
)
# id event timestamp timeElapsed eventType
# <int> <ord> <dttm> <dbl> <chr>
# 1 43956 LOGIN 2017-03-27 09:19:29 NA <NA>
# 2 46640 LOGIN 2016-06-10 00:09:08 NA <NA>
# 3 71548 LOGIN 2016-01-27 12:00:25 NA <NA>
# 4 71548 LOGIN 2016-06-20 11:34:29 NA <NA>
# 5 71548 LOGOUT 2016-06-20 11:35:44 1.25000 Duration
# 6 72029 LOGIN 2016-12-28 10:43:25 NA <NA>
# 7 72029 LOGOUT 2016-12-28 10:56:30 13.08333 Duration
# 8 74558 LOGIN 2016-10-15 15:08:39 NA <NA>
# 9 74558 LOGOUT 2016-10-15 15:10:06 1.45000 Duration
# 10 100596 LOGIN 2016-03-09 14:30:48 NA <NA>
# 11 100596 LOGOUT 2016-03-09 14:31:10 22.00000 Duration
# 12 100596 LOGIN 2017-04-03 10:36:54 44.00000 Interval
# 13 104630 LOGIN 2016-01-11 16:52:08 NA <NA>
# 14 104630 LOGIN 2016-02-03 14:40:32 NA <NA>
# 15 104630 LOGIN 2016-03-30 12:34:56 NA <NA>
# 16 104630 LOGIN 2016-05-26 13:14:25 NA <NA>
# 17 104630 LOGIN 2016-08-22 15:20:02 NA <NA>
# 18 104630 LOGOUT 2016-08-22 15:21:53 51.00000 Duration
# 19 104630 LOGIN 2016-08-22 15:22:23 30.00000 Interval
# 20 104630 LOGOUT 2016-08-22 15:23:08 45.00000 Duration
Using lubridate::seconds_to_period will give you the time difference in "%d %H %M %S" format.

Assuming that any user will stay logged in indefinitely until logs out, it seems the data can be ordered in a way so that a simple "lag" function will do the trick.
Using the library dplyr and assuming that you've called your dataframe "df" and you have already converted the timestamp to a date format such as POSIXct:
df %>% arrange(id,timestamp) %>%
group_by(id,event)%>%
mutate(rank = dense_rank(timestamp)) %>%
ungroup() %>%
arrange(id, rank,timestamp) %>%
group_by(id)%>%
mutate(duration = ifelse(event == "LOGOUT", timestamp- lag(timestamp),NA))
Line by line.
First, we order the data by "id" and "timestamp" and we group by "id" and "event" to assign the rank of the login and logout events. The First login for the same user will have the "rank" 1 and the first log out for that user will also have the "rank" 1.
df %>% arrange(id,timestamp) %>%
group_by(id,event)%>%
mutate(rank = dense_rank(timestamp))
Then, we remove the groupings of the data and we sort again by id, rank and timestamp. This will yield a dataframe with the right order, with the LOGIN events followed by LOGOUT events for each user, so we can apply a lag calculation.
ungroup() %>%
arrange(id, rank,timestamp) %>%
Finally, we group again by "id" and we use mutate to calculate the lag of the timestamps only for the LOGOUT events.
group_by(id)%>%
mutate(duration = ifelse(event == "LOGOUT", timestamp- lag(timestamp),NA))
That should yield a dataframe such as:
id event timestamp rank duration
<int> <chr> <dttm> <int> <dbl>
1 43956 LOGIN 2017-03-27 09:19:29 1 NA
2 46640 LOGIN 2016-06-10 00:09:08 1 NA
3 71548 LOGIN 2016-01-27 12:00:25 1 NA
4 71548 LOGOUT 2016-06-20 11:35:44 1 208715.31667
5 71548 LOGIN 2016-06-20 11:34:29 2 NA
6 72029 LOGIN 2016-12-28 10:43:25 1 NA
7 72029 LOGOUT 2016-12-28 10:56:30 1 13.08333
8 74558 LOGIN 2016-10-15 15:08:39 1 NA
9 74558 LOGOUT 2016-10-15 15:10:06 1 1.45000
10 100596 LOGIN 2016-03-09 14:30:48 1 NA
11 100596 LOGOUT 2016-03-09 14:31:10 1 22.00000

Related

Calculate time of session from logs in R

I have a nice dateset that includes user logins to my website, but no user logout. The user can access again the next day and then another line is registered. I'm looking to calculate how much time each user spent on the site.
The working assumption is that the site is quite interactive, and one can assume that the time between the first and last action is the time the user was on the site. The problem starts with a final action definition. For example, a user can be on the same page all night too, but this is unlikely (without taking any action). It is very clear to me how to calculate it in reality, but haven't been able to find a proper code.
library(tidyverse)
df <- read_csv(url("https://srv-file9.gofile.io/download/eBInE9/sheet1.csv"))
df %>% group_by(`User ID`)
lag(df$Time,1)
##Now I am wondering how to cluster the itemsn by time and to calculate lags..
Any thoughts?
Thanks!
I am assuming that the csv file you read has a POSIXct column which contains the login time. If not, you should make sure this exists.
Here's some code which generates time differences using the lag function by ID. The first time difference for each group will be NA. I generate some random time data first as you have not provided sample data (as you should have).
library(dplyr)
library(lubridate)
library(tibble)
random_times <- seq(as.POSIXct('2020/01/01'), as.POSIXct('2020/02/01'), by="1 mins")
login <- tibble(login = sort(sample(random_times[hour(random_times) > "00:00" & hour(random_times) < "23:59"], 1000)),
ID = sample(LETTERS[1:6], 1000, replace = TRUE))
login <- login %>%
group_by(ID) %>%
mutate(login_delayed = lag(login, 1)) %>%
mutate(login_time = login - login_delayed)
With this output:
> login
# A tibble: 1,000 x 4
# Groups: ID [6]
login ID login_delayed login_time
<dttm> <chr> <dttm> <drtn>
1 2020-01-01 01:03:00 A NA NA mins
2 2020-01-01 01:11:00 A 2020-01-01 01:03:00 8 mins
3 2020-01-01 01:46:00 E NA NA mins
4 2020-01-01 02:33:00 E 2020-01-01 01:46:00 47 mins
5 2020-01-01 02:47:00 A 2020-01-01 01:11:00 96 mins
6 2020-01-01 10:43:00 F NA NA mins
7 2020-01-01 11:44:00 A 2020-01-01 02:47:00 537 mins
8 2020-01-01 11:57:00 A 2020-01-01 11:44:00 13 mins
9 2020-01-01 12:02:00 F 2020-01-01 10:43:00 79 mins
10 2020-01-01 12:57:00 D NA NA mins
# ... with 990 more rows

Can't access Custom Dimension in Google Analytics API via R

I'm attempting to create reports in R using the googleAnalyticsR package. We have 4 custom dimensions defined in Google Analytics, and when I try to access 2 of them, no data is retrieved.
In code below, When I use Dimension1 or Dimension2, everything runs smoothly. But if I try Dimension3 or Dimension4, no data is retrieved
library(googleAnalyticsR)
sampled_data_fetch2 <- google_analytics(ga_id,
date_range = c("2018-01-01","2018-02-28"),
metrics = c("sessions"),
dimensions = c("date", "dimension2"))
2019-09-09 17:58:16> Downloaded [1000] rows from a total of [8079].
sampled_data_fetch2 <- google_analytics(ga_id,
date_range = c("2018-01-01","2018-02-28"),
metrics = c("sessions"),
dimensions = c("date", "dimension4"))
2019-09-09 17:59:16> Downloaded [0] rows from a total of [].
I suspect that Dimension3 and Dimension4 are not session-scoped custom dimensions (hence the metrics sessions is not returning any results). The output of ga_custom_vars_list should confirm this (specifically the scope column):
cds <- ga_custom_vars_list(
accountId = account_details$accountId,
webPropertyId = account_details$webPropertyId
)
head(cds)
Output:
id accountId webPropertyId name index
1 ga:dimension1 418XXXXX UA-XXXXX-5 CD1 NAME (1) 1
2 ga:dimension2 418XXXXX UA-XXXXX-5 CD2 NAME (2) 2
3 ga:dimension3 418XXXXX UA-XXXXX-5 CD3 NAME (3) 3
4 ga:dimension4 418XXXXX UA-XXXXX-5 CD4 NAME (4) 4
5 ga:dimension5 418XXXXX UA-XXXXX-5 CD5 NAME (5) 5
6 ga:dimension6 418XXXXX UA-XXXXX-5 CD6 NAME (6) 6
scope active created updated
1 SESSION TRUE 2014-02-18 18:42:23 2017-05-25 16:34:20
2 SESSION TRUE 2015-09-17 21:11:19 2017-05-25 16:34:29
3 HIT TRUE 2016-06-01 15:12:18 2016-06-01 15:12:18
4 HIT TRUE 2016-06-01 15:12:27 2017-05-25 16:36:24
5 HIT TRUE 2016-06-01 15:12:42 2017-05-25 16:36:29
6 HIT TRUE 2016-06-02 11:27:14 2016-06-02 11:27:14

R: data.table: aggregation using referencing over time

I have a dataset with periods
active <- data.table(id=c(1,1,2,3), beg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:50:00","2018-01-01 01:50:00","2018-01-01 01:50:00")), end=as.POSIXct(c("2018-01-01 01:20:00","2018-01-01 02:00:00","2018-01-01 02:00:00","2018-01-01 02:00:00")))
> active
id beg end
1: 1 2018-01-01 01:10:00 2018-01-01 01:20:00
2: 1 2018-01-01 01:50:00 2018-01-01 02:00:00
3: 2 2018-01-01 01:50:00 2018-01-01 02:00:00
4: 3 2018-01-01 01:50:00 2018-01-01 02:00:00
during which an id was active. I would like to aggregate across ids and determine for every point in
time <- data.table(seq(from=min(active$beg),to=max(active$end),by="mins"))
the number of IDs that are inactive and the average number of minutes until they get active. That is, ideally, the table looks like
>ans
time inactive av.time
1: 2018-01-01 01:10:00 2 30
2: 2018-01-01 01:11:00 2 29
...
50: 2018-01-01 02:00:00 0 0
I believe this can be done using data.table but I cannot figure out the syntax to get the time differences.
Using dplyr, we can join by a dummy variable to create the Cartesian product of time and active. The definitions of inactive and av.time might not be exactly what you're looking for, but it should get you started. If your data is very large, I agree that data.table will be a better way of handling this.
library(tidyverse)
time %>%
mutate(dummy = TRUE) %>%
inner_join({
active %>%
mutate(dummy = TRUE)
#join by the dummy variable to get the Cartesian product
}, by = c("dummy" = "dummy")) %>%
select(-dummy) %>%
#define what makes an id inactive and the time until it becomes active
mutate(inactive = time < beg | time > end,
TimeUntilActive = ifelse(beg > time, difftime(beg, time, units = "mins"), NA)) %>%
#group by time and summarise
group_by(time) %>%
summarise(inactive = sum(inactive),
av.time = mean(TimeUntilActive, na.rm = TRUE))
# A tibble: 51 x 3
time inactive av.time
<dttm> <int> <dbl>
1 2018-01-01 01:10:00 3 40
2 2018-01-01 01:11:00 3 39
3 2018-01-01 01:12:00 3 38
4 2018-01-01 01:13:00 3 37
5 2018-01-01 01:14:00 3 36
6 2018-01-01 01:15:00 3 35
7 2018-01-01 01:16:00 3 34
8 2018-01-01 01:17:00 3 33
9 2018-01-01 01:18:00 3 32
10 2018-01-01 01:19:00 3 31

How can I use mutate to create a new column based only on a subset of other rows of a data frame?

I was agonizing over how to phrase my question. I have a data frame of accounts and I want to create a new column that is a flag for whether there is another account that has a duplicate email within 30 days of that account.
I have a table like this.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(AccountNumbers,EmailAddress,Dates)
print(df)
AccountNumbers EmailAddress Dates
3748 John#gmail.com 2018-05-01
8894 John#gmail.com 2018-05-05
9923 Alex#outlook.com 2018-05-10
4502 Alan#yahoo.com 2018-05-15
7283 Stan#aol.com 2018-05-20
8012 Mary#outlook.com 2018-05-25
2938 Adam#outlook.com 2018-05-30
7485 Tom#aol.com 2018-06-01
1010 Jane#yahoo.com 2018-06-05
9877 John#gmail.com 2018-06-10
John#gmail.com appears three times, I want to flag the first two rows because they both appear within 30 days of each other, but I don't want to flag the third.
AccountNumbers EmailAddress Dates DuplicateEmailFlag
3748 John#gmail.com 2018-05-01 1
8894 John#gmail.com 2018-05-05 1
9923 Alex#outlook.com 2018-05-10 0
4502 Alan#yahoo.com 2018-05-15 0
7283 Stan#aol.com 2018-05-20 0
8012 Mary#outlook.com 2018-05-25 0
2938 Adam#outlook.com 2018-05-30 0
7485 Tom#aol.com 2018-06-01 0
1010 Jane#yahoo.com 2018-06-05 0
9877 John#gmail.com 2018-06-10 0
I've been trying to use an ifelse() inside of mutate, but I don't know if it's possible to tell dplyr to only consider rows that are within 30 days of the row being considered.
Edit: To clarify, I want to look at the 30 days around each account. So that if I had a scenario where the same email address was being added exactly every 30 days, all of the occurrences of that email should be flagged.
This seems to work. First, I define the data frame.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(number = AccountNumbers, email = EmailAddress, date = as.Date(Dates))
Next, I group by email and check if there's an entry in the preceding or following 30 days. I also replace NAs (corresponding to cases with only one entry) with 0. Finally, I ungroup.
df %>%
group_by(email) %>%
mutate(dupe = coalesce(date - lag(date) < 30, (date - lead(date) < 30))) %>%
mutate(dupe = replace_na(dupe, 0)) %>%
ungroup
This gives,
# # A tibble: 10 x 4
# number email date dupe
# <dbl> <fct> <date> <dbl>
# 1 3748 John#gmail.com 2018-05-01 1
# 2 8894 John#gmail.com 2018-05-05 1
# 3 9923 Alex#outlook.com 2018-05-10 0
# 4 4502 Alan#yahoo.com 2018-05-15 0
# 5 7283 Stan#aol.com 2018-05-20 0
# 6 8012 Mary#outlook.com 2018-05-25 0
# 7 2938 Adam#outlook.com 2018-05-30 0
# 8 7485 Tom#aol.com 2018-06-01 0
# 9 1010 Jane#yahoo.com 2018-06-05 0
# 10 9877 John#gmail.com 2018-06-10 0
as required.
Edit: This makes the implicit assumption that your data are sorted by date. If not, you'd need to add an extra step to do so.
I think this gets at what you want:
df %>%
group_by(EmailAddress) %>%
mutate(helper = cumsum(coalesce(if_else(difftime(Dates, lag(Dates), 'days') <= 30, 0, 1), 0))) %>%
group_by(EmailAddress, helper) %>%
mutate(DuplicateEmailFlag = (n() >= 2)*1) %>%
ungroup() %>%
select(-helper)
# A tibble: 10 x 4
AccountNumbers EmailAddress Dates DuplicateEmailFlag
<dbl> <chr> <date> <dbl>
1 3748 John#gmail.com 2018-05-01 1
2 8894 John#gmail.com 2018-05-05 1
3 9923 Alex#outlook.com 2018-05-10 0
4 4502 Alan#yahoo.com 2018-05-15 0
5 7283 Stan#aol.com 2018-05-20 0
6 8012 Mary#outlook.com 2018-05-25 0
7 2938 Adam#outlook.com 2018-05-30 0
8 7485 Tom#aol.com 2018-06-01 0
9 1010 Jane#yahoo.com 2018-06-05 0
10 9877 John#gmail.com 2018-06-10 0
Note:
I think #Lyngbakr's solution is better for the circumstances in your question. Mine would be more appropriate if the size of the duplicate group might change (e.g., you want to check for 3 or 4 entries within 30 days of each other, rather than 2).
slightly modified data
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- as.Date(c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10"))
df <- data.frame(AccountNumbers,EmailAddress,Dates, stringsAsFactors = FALSE)

Count how many cases exist per week given start and end dates of each case [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I'm new here, so I apologize if I miss any conventions.
I have a ~2000 row dataset with data on unique cases happening in a three year period. Each case has a start date and an end date. I want to be able to get a new dataframe that shows how many cases occur per week in this three year period.
The structure of the dataset I have is like this:
ID Start_Date End_Date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03
`
This problem can be solved more easily with sqldf package but I thought to stick with dplyr package.
The approach:
library(dplyr)
library(lubridate)
# First create a data frame having all weeks from chosen start date to end date.
# 2015-01-01 to 2017-12-31
df_week <- data.frame(weekStart = seq(floor_date(as.Date("2015-01-01"), "week"),
as.Date("2017-12-31"), by = 7))
df_week <- df_week %>%
mutate(weekEnd = weekStart + 7,
weekNum = as.character(weekStart, "%V-%Y"),
dummy = TRUE)
# The dummy column is only for joining purpose.
# Header looks like
#> head(df_week)
# weekStart weekEnd weekNum dummy
#1 2014-12-28 2015-01-04 52-2014 TRUE
#2 2015-01-04 2015-01-11 01-2015 TRUE
#3 2015-01-11 2015-01-18 02-2015 TRUE
#4 2015-01-18 2015-01-25 03-2015 TRUE
#5 2015-01-25 2015-02-01 04-2015 TRUE
#6 2015-02-01 2015-02-08 05-2015 TRUE
# Prepare the data as mentioned in OP
df <- read.table(text = "ID Start_Date End_Date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03", header = TRUE, stringsAsFactors = FALSE)
df$Start_Date <- as.Date(df$Start_Date)
df$End_Date <- as.Date(df$End_Date)
df <- df %>% mutate(dummy = TRUE) # just for joining
# Use dplyr to join, filter and then group on week to find number of cases
# in each week
df_week %>%
left_join(df, by = "dummy") %>%
select(-dummy) %>%
filter((weekStart >= Start_Date & weekStart <= End_Date) |
(weekEnd >= Start_Date & weekEnd <= End_Date)) %>%
group_by(weekStart, weekEnd, weekNum) %>%
summarise(cases = n())
# Result
# weekStart weekEnd weekNum cases
# <date> <date> <chr> <int>
# 1 2014-12-28 2015-01-04 52-2014 1
# 2 2015-01-04 2015-01-11 01-2015 3
# 3 2015-01-11 2015-01-18 02-2015 5
# 4 2015-01-18 2015-01-25 03-2015 8
# 5 2015-01-25 2015-02-01 04-2015 8
# 6 2015-02-01 2015-02-08 05-2015 8
# 7 2015-02-08 2015-02-15 06-2015 8
# 8 2015-02-15 2015-02-22 07-2015 8
# 9 2015-02-22 2015-03-01 08-2015 8
#10 2015-03-01 2015-03-08 09-2015 8
# ... with 139 more rows
Welcome to SO!
Before solving the problem be sure to have installed some packages and run
install.packages(c("tidyr","dplyr","lubridate"))
if you haven installed those packages yet.
I'll present you a modern R solution next and those packages are magic.
This is a way to solve it:
library(readr)
library(dplyr)
library(lubridate)
raw_data <- 'id start_date end_date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03'
curated_data <- read_delim(raw_data, delim = "\t") %>%
mutate(start_date = as.Date(start_date)) %>% # convert column 2 to date format assuming the date is yyyy-mm-dd
mutate(weeks_lapse = as.integer((start_date - min(start_date))/dweeks(1))) # count how many weeks passed since the lowest date in the data
curated_data %>%
group_by(weeks_lapse) %>% # I group to count by week
summarise(cases_per_week = n()) # now count by group by week
And the solution is:
# A tibble: 3 x 2
weeks_lapse cases_per_week
<int> <int>
1 0 3
2 1 2
3 2 3

Resources