R aggregate second data to minutes more efficient - r

I have a data.table, allData, containing data on roughly every (POSIXct) second from different nights. Some nights however are on the same date since data is collected from different people, so I have a column nightNo as an id for every different night.
timestamp nightNo data1 data2
2018-10-19 19:15:00 1 1 7
2018-10-19 19:15:01 1 2 8
2018-10-19 19:15:02 1 3 9
2018-10-19 18:10:22 2 4 10
2018-10-19 18:10:23 2 5 11
2018-10-19 18:10:24 2 6 12
I'd like to aggregate the data to minutes (per night) and using this question I've come up with the following code:
aggregate_minute <- function(df){
df %>%
group_by(timestamp = cut(timestamp, breaks= "1 min")) %>%
summarise(data1= mean(data1), data2= mean(data2)) %>%
as.data.table()
}
allData <- allData[, aggregate_minute(allData), by=nightNo]
However my data.table is quite large and this code isn't fast enough. Is there a more efficient way to solve this problem?

allData <- data.table(timestamp = c(rep(Sys.time(), 3), rep(Sys.time() + 320, 3)),
nightNo = rep(1:2, c(3, 3)),
data1 = 1:6,
data2 = 7:12)
timestamp nightNo data1 data2
1: 2018-06-14 10:43:11 1 1 7
2: 2018-06-14 10:43:11 1 2 8
3: 2018-06-14 10:43:11 1 3 9
4: 2018-06-14 10:48:31 2 4 10
5: 2018-06-14 10:48:31 2 5 11
6: 2018-06-14 10:48:31 2 6 12
allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]
nightNo timestamp data1 data2
1: 1 2018-06-14 10:43:00 2 8
2: 2 2018-06-14 10:48:00 5 11
> system.time(replicate(500, allData[, aggregate_minute(allData), by=nightNo]))
user system elapsed
3.25 0.02 3.31
> system.time(replicate(500, allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]))
user system elapsed
1.02 0.04 1.06

You can use lubridate to 'round' the dates and then use data.table to aggregate the columns.
library(data.table)
library(lubridate)
Reproducible data:
text <- "timestamp nightNo data1 data2
'2018-10-19 19:15:00' 1 1 7
'2018-10-19 19:15:01' 1 2 8
'2018-10-19 19:15:02' 1 3 9
'2018-10-19 18:10:22' 2 4 10
'2018-10-19 18:10:23' 2 5 11
'2018-10-19 18:10:24' 2 6 12"
allData <- read.table(text = text, header = TRUE, stringsAsFactors = FALSE)
Create data.table:
setDT(allData)
Create a timestamp and floor it to the nearest minute:
allData[, timestamp := floor_date(ymd_hms(timestamp), "minutes")]
Change the type of the integer columns to numeric:
allData[, ':='(data1 = as.numeric(data1),
data2 = as.numeric(data2))]
Replace the data columns with their means by nightNo group:
allData[, ':='(data1 = mean(data1),
data2 = mean(data2)),
by = nightNo]
The result is:
timestamp nightNo data1 data2
1: 2018-10-19 19:15:00 1 2 8
2: 2018-10-19 19:15:00 1 2 8
3: 2018-10-19 19:15:00 1 2 8
4: 2018-10-19 18:10:00 2 5 11
5: 2018-10-19 18:10:00 2 5 11
6: 2018-10-19 18:10:00 2 5 11

Related

Joining two data frames on the closest date in R

I have two datasets that I would like to join based on date. One is a survey dataset, and the other is a list of prices at various dates. The dates don't match exactly, so I would like to join on the nearest date in the survey dataset (the price data is weekly).
Here's a brief snippet of what the survey dataset looks like (there are many other variables, but here's the two most relevant):
ID
actual.date
20120377
2012-09-26
2020455822
2020-11-23
20126758
2012-10-26
20124241
2012-10-25
2020426572
2020-11-28
And here's the price dataset (also much larger, but you get the idea):
date
price.var1
price.var2
2017-10-30
2.74733926399869
2.73994826674735
2015-03-16
2.77028200438506
2.74079930272231
2010-10-18
3.4265947805337
3.41591263539176
2012-10-29
4.10095806545397
4.14717556976502
2012-01-09
3.87888859352037
3.93074237884497
What I would like to do is join the price dataset to the survey dataset, joining on the nearest date.
I've tried a number of different things, none of which have worked to my satisfaction.
#reading in sample data
library(data.table)
library(dplyr)
survey <- fread(" ID actual.date
1: 20120377 2012-09-26
2: 2020455822 2020-11-23
3: 20126758 2012-10-26
4: 20124241 2012-10-25
5: 2020426572 2020-11-28
> ") %>% select(-V1)
price <- fread("date price.var1 price.var2
1: 2017-10-30 2.747339 2.739948
2: 2015-03-16 2.770282 2.740799
3: 2010-10-18 3.426595 3.415913
4: 2012-10-29 4.100958 4.147176
5: 2012-01-09 3.878889 3.930742") %>% select(-V1)
#using data.table
setDT(survey)[,DT_DATE := actual.date]
setDT(price)[,DT_DATE := date]
survey_price <- survey[price,on=.(DT_DATE),roll="nearest"]
#This works, and they join, but it drops a ton of observations, which won't work
#using dplyr
library(dplyr)
survey_price <- left_join(survey,price,by=c("actual.date"="date"))
#this joins them without dropping observations, but all of the price variables become NAs
You were almost there.
In the DT[i,on] syntax, i should be survey to join on all its rows
setDT(survey)
setDT(price)
survey_price <- price[survey,on=.(date=actual.date),roll="nearest"]
survey_price
date price.var1 price.var2 ID
<IDat> <num> <num> <int>
1: 2012-09-26 4.100958 4.147176 20120377
2: 2020-11-23 2.747339 2.739948 2020455822
3: 2012-10-26 4.100958 4.147176 20126758
4: 2012-10-25 4.100958 4.147176 20124241
5: 2020-11-28 2.747339 2.739948 2020426572
Convert the dates to numeric and find the closest date from the survey for price with Closest() from DescTools, and take that value.
Example datasets
survey <- tibble(
ID = sample(20000:40000, 9, replace = TRUE),
actual.date = seq(today() %m+% days(5), today() %m+% days(5) %m+% months(2),
"week")
)
price <- tibble(
date = seq(today(), today() %m+% months(2), by = "week"),
price_1 = sample(2:6, 9, replace = TRUE),
price_2 = sample(2:6, 9, replace = TRUE)
)
survey
# A tibble: 9 x 2
ID actual.date
<int> <date>
1 34592 2022-05-07
2 37846 2022-05-14
3 22715 2022-05-21
4 22510 2022-05-28
5 30143 2022-06-04
6 34348 2022-06-11
7 21538 2022-06-18
8 39802 2022-06-25
9 36493 2022-07-02
price
# A tibble: 9 x 3
date price_1 price_2
<date> <int> <int>
1 2022-05-02 6 6
2 2022-05-09 3 2
3 2022-05-16 6 4
4 2022-05-23 6 2
5 2022-05-30 2 6
6 2022-06-06 2 4
7 2022-06-13 2 2
8 2022-06-20 3 5
9 2022-06-27 5 6
library(tidyverse)
library(lubridate)
library(DescTools)
price <- price %>%
mutate(date = Closest(survey$actual.date %>%
as.numeric, date %>%
as.numeric) %>%
as_date())
# A tibble: 9 x 3
date price_1 price_2
<date> <int> <int>
1 2022-05-07 6 6
2 2022-05-14 3 2
3 2022-05-21 6 4
4 2022-05-28 6 2
5 2022-06-04 2 6
6 2022-06-11 2 4
7 2022-06-18 2 2
8 2022-06-25 3 5
9 2022-07-02 5 6
merge(survey, price, by.x = "actual.date", by.y = "date")
actual.date ID price_1 price_2
1 2022-05-07 34592 6 6
2 2022-05-14 37846 3 2
3 2022-05-21 22715 6 4
4 2022-05-28 22510 6 2
5 2022-06-04 30143 2 6
6 2022-06-11 34348 2 4
7 2022-06-18 21538 2 2
8 2022-06-25 39802 3 5
9 2022-07-02 36493 5 6

Prices returns calculation in a df with many tickers with dplyr

I have a dataframe with 3 columns : Dates, Tickers (i.e. financial instruments) and Prices.
I just want to calculate the returns for each ticker.
Some data to play with:
AsofDate = as.Date(c("2018-01-01","2018-01-02","2018-01-03","2018-01-04","2018-01-05",
"2018-01-01","2018-01-02","2018-01-03","2018-01-04","2018-01-05",
"2018-01-01","2018-01-02","2018-01-03","2018-01-04","2018-01-05"))
Tickers = c("Ticker1", "Ticker1", "Ticker1", "Ticker1", "Ticker1",
"Ticker2", "Ticker2", "Ticker2", "Ticker2", "Ticker2",
"Ticker3", "Ticker3", "Ticker3", "Ticker3", "Ticker3")
Prices =c(1,2,7,4,2,
6,5,7,9,12,
11,11,16,14,15)
df = data.frame(AsofDate, Tickers, Prices)
My first Idea was just to order the Prices by (Tickers Prices), then calculate for all the vector and set at NA the first day...
TTR::ROC(x=Prices)
It works in Excel but I want something more pretty
So I tried something like this:
require(dplyr)
ret = df %>%
select(Tickers,Prices) %>%
group_by(Tickers) %>%
do(data.frame(LogReturns=TTR::ROC(x=Prices)))
df$LogReturns = ret$LogReturns
But Here I get too much values, it seems that the calculation is not done by Tickers.
Can you give me a hint ?
Thanks !!
In dplyr, we can use lag to get previous Prices
library(dplyr)
df %>%
group_by(Tickers) %>%
mutate(returns = (Prices - lag(Prices))/Prices)
# AsofDate Tickers Prices returns
# <date> <fct> <dbl> <dbl>
# 1 2018-01-01 Ticker1 1 NA
# 2 2018-01-02 Ticker1 2 0.5
# 3 2018-01-03 Ticker1 7 0.714
# 4 2018-01-04 Ticker1 4 -0.75
# 5 2018-01-05 Ticker1 2 -1
# 6 2018-01-01 Ticker2 6 NA
# 7 2018-01-02 Ticker2 5 -0.2
# 8 2018-01-03 Ticker2 7 0.286
# 9 2018-01-04 Ticker2 9 0.222
#10 2018-01-05 Ticker2 12 0.25
#11 2018-01-01 Ticker3 11 NA
#12 2018-01-02 Ticker3 11 0
#13 2018-01-03 Ticker3 16 0.312
#14 2018-01-04 Ticker3 14 -0.143
#15 2018-01-05 Ticker3 15 0.0667
In base R, we can use ave with diff
df$returns <- with(df, ave(Prices, Tickers,FUN = function(x) c(NA,diff(x)))/Prices)
We can use data.table
library(data.table)
setDT(df)[, returns := (Prices - shift(Prices))/Prices, by = Tickers]

How to generate a unique ID for each group based on relative date interval in R using dplyr?

I have a cohort of data with multiple person visits and want to group visits with a common ID based on person # and the time of the visit. The condition is if an start is within 24 hours of a the previous exit, then I want those to have the same ID.
Sample of what data looks like:
dat <- data.frame(
Person_ID = c(1,1,1,2,3,3,3,4,4),
Admit_Date_Time = as.POSIXct(c("2017-02-07 15:26:00","2017-04-21 10:20:00",
"2017-04-22 12:12:00", "2017-10-16 01:31:00","2017-01-24 02:41:00","2017- 01-24 05:31:00", "2017-01-28 04:26:00", "2017-12-01 01:31:00","2017-12-01
01:31:00"), format = "%Y-%m-%d %H:%M"),
Discharge_Date_Time = as.POSIXct(c("2017-03-01 11:42:00","2017-04-22
05:56:00",
"2017-04-26 21:01:00",
"2017-10-18 20:11:00",
"2017-01-27 22:15:00",
"2017-01-26 15:35:00",
"2017-01-28 09:25:00",
"2017-12-05 18:33:00",
"2017-12-04 16:41:00"),format = "%Y-%m-%d %H:%M" ),
Visit_ID = c(1:9))
this is what I tried to start:
dat1 <-
dat %>%
arrange(Person_ID, Admit_Date_Time) %>%
group_by(Person_ID) %>%
mutate(Previous_Visit_Interval = difftime(lag(Discharge_Date_Time,
1),Admit_Date_Time, units = "hours")) %>%
mutate(start = c(1,Previous_Visit_Interval[-1] < hours(-24)), run =
cumsum(start))
dat1$ID = as.numeric(as.factor(paste0(dat1$Person_ID,dat1$run)))
Which is almost right, except it does not give the correct ID for visit 7 (person #3). Since there are three visits and the second visit is entirely within the first, and the third starts within 24 hours of the first but not the second.
There's probably a way to shorten this, but here's an approach using tidyr::gather and spread. By gathering into long format, we can track the cumulative admissions inside each visit. A new visit is recorded whenever there's a new Person_ID or that Person_ID completed a visit (cumulative admissions went to zero) at least 24 hours prior.
library(tidyr)
dat1 <- dat %>%
# Gather into long format with event type in one column, timestamp in another
gather(event, time, Admit_Date_Time:Discharge_Date_Time) %>%
# I want discharges to have an effect up to 24 hours later. Sort using that.
mutate(time_adj = if_else(event == "Discharge_Date_Time",
time + ddays(1),
time)) %>%
arrange(Person_ID, time_adj) %>%
# For each Person_ID, track cumulative admissions. 0 means a visit has completed.
# (b/c we sorted by time_adj, these reflect the 24hr period after discharges.)
group_by(Person_ID) %>%
mutate(admissions = if_else(event == "Admit_Date_Time", 1, -1)) %>%
mutate(admissions_count = cumsum(admissions)) %>%
ungroup() %>%
# Record a new Hosp_ID when either (a) a new Person, or (b) preceded by a
# completed visit (ie admissions_count was zero).
mutate(Hosp_ID_chg = 1 *
(Person_ID != lag(Person_ID, default = 1) | # (a)
lag(admissions_count, default = 1) == 0), # (b)
Hosp_ID = cumsum(Hosp_ID_chg)) %>%
# Spread back into original format
select(-time_adj, -admissions, -admissions_count, -Hosp_ID_chg) %>%
spread(event, time)
Results
> dat1
# A tibble: 9 x 5
Person_ID Visit_ID Hosp_ID Admit_Date_Time Discharge_Date_Time
<dbl> <int> <dbl> <dttm> <dttm>
1 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
2 1 2 2 2017-04-21 10:20:00 2017-04-22 05:56:00
3 1 3 2 2017-04-22 12:12:00 2017-04-26 21:01:00
4 2 4 3 2017-10-16 01:31:00 2017-10-18 20:11:00
5 3 5 4 2017-01-24 02:41:00 2017-01-27 22:15:00
6 3 6 4 2017-01-24 05:31:00 2017-01-26 15:35:00
7 3 7 4 2017-01-28 04:26:00 2017-01-28 09:25:00
8 4 8 5 2017-12-01 01:31:00 2017-12-05 18:33:00
9 4 9 5 2017-12-01 01:31:00 2017-12-04 16:41:00
Here's a data.table approach using an overlap-join
library( data.table )
library( lubridate )
setDT( dat )
setorder( dat, Person_ID, Admit_Date_Time )
#create a 1-day extension after each discharge
dt2 <- dat[, discharge_24h := Discharge_Date_Time %m+% days(1)][]
#now create id
setkey( dat, Admit_Date_Time, discharge_24h )
#create data-table with overlap-join, create groups based on overlapping ranges
dt2 <- setorder(
foverlaps( dat,
dat,
mult = "first",
type = "any",
nomatch = 0L
),
Visit_ID )[, list( Visit_ID = i.Visit_ID,
Hosp_ID = .GRP ),
by = .( Visit_ID )][, Visit_ID := NULL]
#reorder the result
setorder( dt2[ dat, on = "Visit_ID" ][, discharge_24h := NULL], Visit_ID )[]
# Visit_ID Hosp_ID Person_ID Admit_Date_Time Discharge_Date_Time
# 1: 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
# 2: 2 2 1 2017-04-21 10:20:00 2017-04-22 05:56:00
# 3: 3 2 1 2017-04-22 12:12:00 2017-04-26 21:01:00
# 4: 4 3 2 2017-10-16 01:31:00 2017-10-18 20:11:00
# 5: 5 4 3 2017-01-24 02:41:00 2017-01-27 22:15:00
# 6: 6 4 3 2017-01-24 05:31:00 2017-01-26 15:35:00
# 7: 7 4 3 2017-01-28 04:26:00 2017-01-28 09:25:00
# 8: 8 5 4 2017-12-01 01:31:00 2017-12-05 18:33:00
# 9: 9 5 4 2017-12-01 01:31:00 2017-12-04 16:41:00

R: Create a New Column in R to determine Semester Based on Two Dates

I have some data. ID and date and I'm trying to create a new field for semester.
df:
id date
1 20160822
2 20170109
3 20170828
4 20170925
5 20180108
6 20180402
7 20160711
8 20150831
9 20160111
10 20160502
11 20160829
12 20170109
13 20170501
I also have a semester table:
start end season_year
20120801 20121222 Fall-2012
20121223 20130123 Winter-2013
20130124 20130523 Spring-2013
20130524 20130805 Summer-2013
20130806 20131228 Fall-2013
20131229 20140122 Winter-2014
20140123 20140522 Spring-2014
20140523 20140804 Summer-2014
20140805 20141227 Fall-2014
20141228 20150128 Winter-2015
20150129 20150528 Spring-2015
20150529 20150803 Summer-2015
20150804 20151226 Fall-2015
20151227 20160127 Winter-2016
20160128 20160526 Spring-2016
20160527 20160801 Summer-2016
20160802 20161224 Fall-2016
20161225 20170125 Winter-2017
20170126 20170525 Spring-2017
20170526 20170807 Summer-2017
20170808 20171230 Fall-2017
20171231 20180124 Winter-2018
20180125 20180524 Spring-2018
20180525 20180806 Summer-2018
20180807 20181222 Fall-2018
20181223 20190123 Winter-2019
20190124 20190523 Spring-2019
20190524 20180804 Summer-2019
I'd like to create a new field in df if df$date is between semester$start and semester$end, then place the respective value semester$season_year in df
I tried to see if the lubridate package could help but that seems to be more for calculations
I saw this question and it seems to be the closest to what i want, but, to make things more complicated, not all of our semesters are six months
Does this work?
library(lubridate)
semester$start <- ymd(semester$start)
semester$end <- ymd(semester$end)
df$date <- ymd(df$date)
LU <- Map(`:`, semester$start, semester$end)
LU <- data.frame(value = unlist(LU),
index = rep(seq_along(LU), lapply(LU, length)))
df$semester <- semester$season_year[LU$index[match(df$date, LU$value)]]
A solution using non-equi update joins using data.table and lubridate package can be as:
library(data.table)
setDT(df)
setDT(semester)
df[,date:=as.IDate(as.character(date), format = "%Y%m%d")]
semester[,':='(start = as.IDate(as.character(start), format = "%Y%m%d"),
end=as.IDate(as.character(end), format = "%Y%m%d"))]
df[semester, on=.(date >= start, date <= end), season_year := i.season_year]
df
# id date season_year
# 1: 1 2016-08-22 Fall-2016
# 2: 2 2017-01-09 Winter-2017
# 3: 3 2017-08-28 Fall-2017
# 4: 4 2017-09-25 Fall-2017
# 5: 5 2018-01-08 Winter-2018
# 6: 6 2018-04-02 Spring-2018
# 7: 7 2016-07-11 Summer-2016
# 8: 8 2015-08-31 Fall-2015
# 9: 9 2016-01-11 Winter-2016
# 10: 10 2016-05-02 Spring-2016
# 11: 11 2016-08-29 Fall-2016
# 12: 12 2017-01-09 Winter-2017
# 13: 13 2017-05-01 Spring-2017
Data:
df <- read.table(text="
id date
1 20160822
2 20170109
3 20170828
4 20170925
5 20180108
6 20180402
7 20160711
8 20150831
9 20160111
10 20160502
11 20160829
12 20170109
13 20170501",
header = TRUE, stringsAsFactors = FALSE)
semester <- read.table(text="
start end season_year
20120801 20121222 Fall-2012
20121223 20130123 Winter-2013
20130124 20130523 Spring-2013
20130524 20130805 Summer-2013
20130806 20131228 Fall-2013
20131229 20140122 Winter-2014
20140123 20140522 Spring-2014
20140523 20140804 Summer-2014
20140805 20141227 Fall-2014
20141228 20150128 Winter-2015
20150129 20150528 Spring-2015
20150529 20150803 Summer-2015
20150804 20151226 Fall-2015
20151227 20160127 Winter-2016
20160128 20160526 Spring-2016
20160527 20160801 Summer-2016
20160802 20161224 Fall-2016
20161225 20170125 Winter-2017
20170126 20170525 Spring-2017
20170526 20170807 Summer-2017
20170808 20171230 Fall-2017
20171231 20180124 Winter-2018
20180125 20180524 Spring-2018
20180525 20180806 Summer-2018
20180807 20181222 Fall-2018
20181223 20190123 Winter-2019
20190124 20190523 Spring-2019
20190524 20180804 Summer-2019",
header = TRUE, stringsAsFactors = FALSE)

Count how many cases exist per week given start and end dates of each case [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I'm new here, so I apologize if I miss any conventions.
I have a ~2000 row dataset with data on unique cases happening in a three year period. Each case has a start date and an end date. I want to be able to get a new dataframe that shows how many cases occur per week in this three year period.
The structure of the dataset I have is like this:
ID Start_Date End_Date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03
`
This problem can be solved more easily with sqldf package but I thought to stick with dplyr package.
The approach:
library(dplyr)
library(lubridate)
# First create a data frame having all weeks from chosen start date to end date.
# 2015-01-01 to 2017-12-31
df_week <- data.frame(weekStart = seq(floor_date(as.Date("2015-01-01"), "week"),
as.Date("2017-12-31"), by = 7))
df_week <- df_week %>%
mutate(weekEnd = weekStart + 7,
weekNum = as.character(weekStart, "%V-%Y"),
dummy = TRUE)
# The dummy column is only for joining purpose.
# Header looks like
#> head(df_week)
# weekStart weekEnd weekNum dummy
#1 2014-12-28 2015-01-04 52-2014 TRUE
#2 2015-01-04 2015-01-11 01-2015 TRUE
#3 2015-01-11 2015-01-18 02-2015 TRUE
#4 2015-01-18 2015-01-25 03-2015 TRUE
#5 2015-01-25 2015-02-01 04-2015 TRUE
#6 2015-02-01 2015-02-08 05-2015 TRUE
# Prepare the data as mentioned in OP
df <- read.table(text = "ID Start_Date End_Date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03", header = TRUE, stringsAsFactors = FALSE)
df$Start_Date <- as.Date(df$Start_Date)
df$End_Date <- as.Date(df$End_Date)
df <- df %>% mutate(dummy = TRUE) # just for joining
# Use dplyr to join, filter and then group on week to find number of cases
# in each week
df_week %>%
left_join(df, by = "dummy") %>%
select(-dummy) %>%
filter((weekStart >= Start_Date & weekStart <= End_Date) |
(weekEnd >= Start_Date & weekEnd <= End_Date)) %>%
group_by(weekStart, weekEnd, weekNum) %>%
summarise(cases = n())
# Result
# weekStart weekEnd weekNum cases
# <date> <date> <chr> <int>
# 1 2014-12-28 2015-01-04 52-2014 1
# 2 2015-01-04 2015-01-11 01-2015 3
# 3 2015-01-11 2015-01-18 02-2015 5
# 4 2015-01-18 2015-01-25 03-2015 8
# 5 2015-01-25 2015-02-01 04-2015 8
# 6 2015-02-01 2015-02-08 05-2015 8
# 7 2015-02-08 2015-02-15 06-2015 8
# 8 2015-02-15 2015-02-22 07-2015 8
# 9 2015-02-22 2015-03-01 08-2015 8
#10 2015-03-01 2015-03-08 09-2015 8
# ... with 139 more rows
Welcome to SO!
Before solving the problem be sure to have installed some packages and run
install.packages(c("tidyr","dplyr","lubridate"))
if you haven installed those packages yet.
I'll present you a modern R solution next and those packages are magic.
This is a way to solve it:
library(readr)
library(dplyr)
library(lubridate)
raw_data <- 'id start_date end_date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03'
curated_data <- read_delim(raw_data, delim = "\t") %>%
mutate(start_date = as.Date(start_date)) %>% # convert column 2 to date format assuming the date is yyyy-mm-dd
mutate(weeks_lapse = as.integer((start_date - min(start_date))/dweeks(1))) # count how many weeks passed since the lowest date in the data
curated_data %>%
group_by(weeks_lapse) %>% # I group to count by week
summarise(cases_per_week = n()) # now count by group by week
And the solution is:
# A tibble: 3 x 2
weeks_lapse cases_per_week
<int> <int>
1 0 3
2 1 2
3 2 3

Resources